[lxml-dev] lxml.html, now with ignored namespaces!

Geoffrey Sneddon foolistbar at googlemail.com
Sat Jul 4 11:13:48 CEST 2009


On 27 Jun 2009, at 07:23, Stefan Behnel wrote:

>> The output:
>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
>> cs="http://something.com/cs" xml:lang="en"
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
>
> That's because HTML parsers are not namespace aware. Namespaces are  
> simply
> not defined for HTML. But if you get a difference on different  
> systems, I'd
> still suspect the reason to be different libxml2 versions. There's  
> nothing
> lxml can do about this.

It should still be outputting an element with a name of "cs:content",  
it shouldn't be dropping the "cs:", as, as you say, there are not  
namespaces in HTML, so it has no meaning.

My basic advice to the OP would be to use html5lib, which is far  
slower, but does cope with this fine.


--
Geoffrey Sneddon
<http://gsnedders.com/>



More information about the lxml-dev mailing list