[lxml-dev] lxml.html, now with ignored namespaces!
Stefan Behnel
stefan_ml at behnel.de
Mon Jun 29 08:36:01 CEST 2009
Hi,
Thomas Weigel wrote:
> Stefan Behnel wrote:
>> That's an XHTML document, for which the XML parser would be the right tool.
>
> Sadly, not every page will be an XHTML document. Nor will every page be
> created by someone like me, an individual who loves XHTML and
> strictness. I apologize for giving the impression that my users might be
> sane and decent.
>
>> If you have XHTML documents that contain unterminated <br> tags, they are
>> not well-formed, and thus simply not XML, i.e. not XHTML.
>
> I will have HTML 4 Loose and HTML5 documents that contain unterminated
> <br> tags, among others.
Well, that's not XHTML then, though, and both aren't that hard to
distinguish even before parsing. What about running the XML parser on the
document first, and only fall back to the HTML parser if the XML parser
fails? Parsing should be fast enough to just go and pay it twice for the
increase in convenience that you get.
If you parse from a (byte?) string, you could also just check if the XHTML
namespace appears in the input data or if the data starts with an XML
declaration ("<?xml..."), and use the XML parser only for those.
Stefan
More information about the lxml-dev
mailing list