[lxml-dev] XHTML handling in lxml.html
Ian Bicking
ianb at colorstudy.com
Tue Mar 4 20:27:17 CET 2008
Stefan Behnel wrote:
> Ian Bicking wrote:
>> translating HTML to XHTML is kind of an outstanding issue for lxml.html,
>> and it seems reasonable to me that XHTML could be parsed into the same
>> classes as HTML. The only real caveat there is that XHTML uses different
>> (namespaced) tag names.
>
> I agree that there is more we could do. For example, we could add "xhtml" as a
> serialisation method and do stuff internally to add a namespace declaration to
> the serialised "<html>" (iff there isn't a namespace declared already). I'm
> not sure if it would be an error if the tree contains non-HTML elements, I
> guess we could just leave that to the user.
I think one of the justifications for XHTML (what few their are ;) is
that it can represent non-HTML elements reasonably elegantly. But I
don't think this is a problem.
>> If you remove the tag names, then the classes and
>> the lookup applies just fine. (Presumably the lookup could be changed to
>> support XHTML fairly easily.)
>
> I would say so, yes. There would also be issues with the XPath expressions in
> things like html.clean, I assume. It would definitely be a good thing if the
> whole machinery could handle namespace-free HTML and namespaced XHTML equally
> well.
This came up with Deliverance as well, as some people want to use XHTML.
Because of all the namespace/URI/prefix confusion, it seems quite
awkward. The most elegant solution, at least in that context, seems
like using just HTML internally. So if we get XHTML, we parse it as XML
and remove the namespace from every element in the namespace
http://www.w3.org/1999/xhtml. Then when serializing to XHTML, we add
that namespace to everything that doesn't have a namespace (and maybe
with a whitelist of elements in XHTML). Then internally there's a
consistent representation, and the XHTML/HTML division can be treated
more like a parsing/serialization issue.
Arguably the distinction is more than just serialization, and
{http://www.w3.org/1999/xhtml}div is really distinct from a plain div.
But that's not an argument I'd make ;)
Mostly as an aside, I'm planning to parse XHTML using the XML parser,
but if it fails to use the HTML parser, as the parsing error behavior of
the two parsers is so different that they aren't really equivalent.
Or... put another way, if you consider the error-tolerant HTML parser to
be suitable for a task, then the error-intolerant XML parser may not be
suitable (by itself).
Ian
More information about the lxml-dev
mailing list