[lxml-dev] lxml.html, now with ignored namespaces!

Thomas Weigel seasong at chantofwaves.com
Sun Jun 28 23:09:22 CEST 2009


Stefan Behnel wrote:
> That's an XHTML document, for which the XML parser would be the right tool.

Sadly, not every page will be an XHTML document. Nor will every page be 
created by someone like me, an individual who loves XHTML and 
strictness. I apologize for giving the impression that my users might be 
sane and decent.

> If you have XHTML documents that contain unterminated <br> tags, they are
> not well-formed, and thus simply not XML, i.e. not XHTML.

I will have HTML 4 Loose and HTML5 documents that contain unterminated 
<br> tags, among others.

> Obviously, the best way to deal with this kind of problem is fixing the
> input documents.

Sadly, not possible. I mean, it would be nice. It surely would. But no.

>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
>> cs="http://something.com/cs" xml:lang="en" 
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
> 
> That's because HTML parsers are not namespace aware. Namespaces are simply
> not defined for HTML. But if you get a difference on different systems, I'd
> still suspect the reason to be different libxml2 versions. There's nothing
> lxml can do about this.

Yes, I gathered that from the previous reply. There's not much I can do 
about it, either, since I won't be in control of the specific libxml2 
installation.

Currently, I have a small unit test built in that checks the parser for 
eliminating namespaces or not. If the parser eliminates the namespace, I 
replace all "cs:something" attributes with "cs_something" attributes.

It's far from ideal, but it at least works.

Again, thank you.


Thomas Weigel


More information about the lxml-dev mailing list