[lxml-dev] Question on clean_html

Ian Bicking ianb at colorstudy.com
Mon Dec 1 21:32:57 CET 2008


Brian Neal wrote:
> Hi,
> 
> I would like to use lxml to remove all tags except 'a' tags. Is this possible?
> 
> I don't seem to understand the arguments to the Cleaner class. What
> does allow_tags do?
> 
> I tried this:
> 
>>>> c = Cleaner(allow_tags=('a',), remove_unknown_tags=False)
>>>> print c.clean_html('<b>Hi</b>')
> <b>Hi</b>
> 
> Do I instead have to list all the tags I don't want, except for 'a',
> in a remove_tags keyword argument?
> 
> Any hints? Thank you.

There's not really a way to do this with the Cleaner I'm afraid. 
(Hrm... I really need to clean up the options there, as they overlap in 
lots of weird ways and are confusing.)

The method .drop_tag could help here, like (untested):

for el in list(doc.iter()):
     if el.tag not in ['a']:
         el.drop_tag()

I'm not 100% sure what happens if you modify the tree in place like 
this, though I think list() will make it work.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org


More information about the lxml-dev mailing list