[lxml-dev] lxml.html, now with ignored namespaces!
Thomas Weigel
seasong at chantofwaves.com
Sun Jun 28 23:09:22 CEST 2009
Stefan Behnel wrote:
> That's an XHTML document, for which the XML parser would be the right tool.
Sadly, not every page will be an XHTML document. Nor will every page be
created by someone like me, an individual who loves XHTML and
strictness. I apologize for giving the impression that my users might be
sane and decent.
> If you have XHTML documents that contain unterminated <br> tags, they are
> not well-formed, and thus simply not XML, i.e. not XHTML.
I will have HTML 4 Loose and HTML5 documents that contain unterminated
<br> tags, among others.
> Obviously, the best way to deal with this kind of problem is fixing the
> input documents.
Sadly, not possible. I mean, it would be nice. It surely would. But no.
>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
>> cs="http://something.com/cs" xml:lang="en"
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
>
> That's because HTML parsers are not namespace aware. Namespaces are simply
> not defined for HTML. But if you get a difference on different systems, I'd
> still suspect the reason to be different libxml2 versions. There's nothing
> lxml can do about this.
Yes, I gathered that from the previous reply. There's not much I can do
about it, either, since I won't be in control of the specific libxml2
installation.
Currently, I have a small unit test built in that checks the parser for
eliminating namespaces or not. If the parser eliminates the namespace, I
replace all "cs:something" attributes with "cs_something" attributes.
It's far from ideal, but it at least works.
Again, thank you.
Thomas Weigel
More information about the lxml-dev
mailing list