[lxml-dev] Question on clean_html
Ian Bicking
ianb at colorstudy.com
Mon Dec 1 21:32:57 CET 2008
Brian Neal wrote:
> Hi,
>
> I would like to use lxml to remove all tags except 'a' tags. Is this possible?
>
> I don't seem to understand the arguments to the Cleaner class. What
> does allow_tags do?
>
> I tried this:
>
>>>> c = Cleaner(allow_tags=('a',), remove_unknown_tags=False)
>>>> print c.clean_html('<b>Hi</b>')
> <b>Hi</b>
>
> Do I instead have to list all the tags I don't want, except for 'a',
> in a remove_tags keyword argument?
>
> Any hints? Thank you.
There's not really a way to do this with the Cleaner I'm afraid.
(Hrm... I really need to clean up the options there, as they overlap in
lots of weird ways and are confusing.)
The method .drop_tag could help here, like (untested):
for el in list(doc.iter()):
if el.tag not in ['a']:
el.drop_tag()
I'm not 100% sure what happens if you modify the tree in place like
this, though I think list() will make it work.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
More information about the lxml-dev
mailing list