[lxml-dev] XHTML handling in lxml.html

Stefan Behnel stefan_ml at behnel.de
Sat Mar 1 09:49:25 CET 2008


Ian Bicking wrote:
> translating HTML to XHTML is kind of an outstanding issue for lxml.html,
> and it seems reasonable to me that XHTML could be parsed into the same
> classes as HTML.  The only real caveat there is that XHTML uses different
> (namespaced) tag names.

I agree that there is more we could do. For example, we could add "xhtml" as a
serialisation method and do stuff internally to add a namespace declaration to
the serialised "<html>" (iff there isn't a namespace declared already). I'm
not sure if it would be an error if the tree contains non-HTML elements, I
guess we could just leave that to the user.


> If you remove the tag names, then the classes and
> the lookup applies just fine.  (Presumably the lookup could be changed to
> support XHTML fairly easily.)

I would say so, yes. There would also be issues with the XPath expressions in
things like html.clean, I assume. It would definitely be a good thing if the
whole machinery could handle namespace-free HTML and namespaced XHTML equally
well.

Stefan


More information about the lxml-dev mailing list