[lxml-dev] XHTML handling in lxml.html

Ian Bicking ianb at colorstudy.com
Tue Mar 4 20:27:17 CET 2008


Stefan Behnel wrote:
> Ian Bicking wrote:
>> translating HTML to XHTML is kind of an outstanding issue for lxml.html,
>> and it seems reasonable to me that XHTML could be parsed into the same
>> classes as HTML.  The only real caveat there is that XHTML uses different
>> (namespaced) tag names.
> 
> I agree that there is more we could do. For example, we could add "xhtml" as a
> serialisation method and do stuff internally to add a namespace declaration to
> the serialised "<html>" (iff there isn't a namespace declared already). I'm
> not sure if it would be an error if the tree contains non-HTML elements, I
> guess we could just leave that to the user.

I think one of the justifications for XHTML (what few their are ;) is 
that it can represent non-HTML elements reasonably elegantly.  But I 
don't think this is a problem.


>> If you remove the tag names, then the classes and
>> the lookup applies just fine.  (Presumably the lookup could be changed to
>> support XHTML fairly easily.)
> 
> I would say so, yes. There would also be issues with the XPath expressions in
> things like html.clean, I assume. It would definitely be a good thing if the
> whole machinery could handle namespace-free HTML and namespaced XHTML equally
> well.

This came up with Deliverance as well, as some people want to use XHTML. 
  Because of all the namespace/URI/prefix confusion, it seems quite 
awkward.  The most elegant solution, at least in that context, seems 
like using just HTML internally.  So if we get XHTML, we parse it as XML 
and remove the namespace from every element in the namespace 
http://www.w3.org/1999/xhtml.  Then when serializing to XHTML, we add 
that namespace to everything that doesn't have a namespace (and maybe 
with a whitelist of elements in XHTML).  Then internally there's a 
consistent representation, and the XHTML/HTML division can be treated 
more like a parsing/serialization issue.

Arguably the distinction is more than just serialization, and 
{http://www.w3.org/1999/xhtml}div is really distinct from a plain div. 
But that's not an argument I'd make ;)

Mostly as an aside, I'm planning to parse XHTML using the XML parser, 
but if it fails to use the HTML parser, as the parsing error behavior of 
the two parsers is so different that they aren't really equivalent. 
Or... put another way, if you consider the error-tolerant HTML parser to 
be suitable for a task, then the error-intolerant XML parser may not be 
suitable (by itself).

   Ian


More information about the lxml-dev mailing list