[lxml-dev] Controlling attributes in lxml.html.clean

Stefan Behnel stefan_ml at behnel.de
Fri Jul 18 08:02:44 CEST 2008


Hi,

Hi,

Bruno Barberi Gnecco wrote:
> Is there someway to control tag attributes in lxml.html.clean? Specifically I'm
> trying to get rid of any 'id' attributes. It seems that the only control
> available is the safe_attrs_only flag, which only allows defs.safe_attrs to be
> used. 

We don't currently support an attribute whitelist.


> May I suggest an API somewhat like this:
> http://amisphere.com/contrib/python-html-filter/ for the next releases? I'd be
> happy to collaborate to implement it.

That's a bit simplistic, though.

What I would like to see is a dict that maps all HTML tag names to a list of
attribute names allowed on them in HTML(5?). That would fit into defs.py.

It should only handle tags that are in the dict, and not try to also remove
tags that are missing. That way, users could just create a dict with the tags
they want to see treated.

I'm not sure about the rest of the API. Users could copy the dict in defs and
remove things they don't want, and then pass it to the Cleaner to have it
remove every attribute that's not in there. Or maybe a separate filter
function would make sense, as this is generally applicable to all XML.

Would be nice if you could come up with something like that. (don't forget the
doctests!)

Stefan


More information about the lxml-dev mailing list