[lxml-dev] XHTML handling in lxml.html
Stefan Behnel
stefan_ml at behnel.de
Sat Mar 1 09:49:25 CET 2008
Ian Bicking wrote:
> translating HTML to XHTML is kind of an outstanding issue for lxml.html,
> and it seems reasonable to me that XHTML could be parsed into the same
> classes as HTML. The only real caveat there is that XHTML uses different
> (namespaced) tag names.
I agree that there is more we could do. For example, we could add "xhtml" as a
serialisation method and do stuff internally to add a namespace declaration to
the serialised "<html>" (iff there isn't a namespace declared already). I'm
not sure if it would be an error if the tree contains non-HTML elements, I
guess we could just leave that to the user.
> If you remove the tag names, then the classes and
> the lookup applies just fine. (Presumably the lookup could be changed to
> support XHTML fairly easily.)
I would say so, yes. There would also be issues with the XPath expressions in
things like html.clean, I assume. It would definitely be a good thing if the
whole machinery could handle namespace-free HTML and namespaced XHTML equally
well.
Stefan
More information about the lxml-dev
mailing list