[lxml-dev] lxml.html, now with ignored namespaces!
Geoffrey Sneddon
foolistbar at googlemail.com
Sat Jul 4 11:13:48 CEST 2009
On 27 Jun 2009, at 07:23, Stefan Behnel wrote:
>> The output:
>> -----
>> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
>> cs="http://something.com/cs" xml:lang="en"
>> lang="en"><head><title>Help!</title></head><body><p>My namespaces are
>> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
>> -----
>
> That's because HTML parsers are not namespace aware. Namespaces are
> simply
> not defined for HTML. But if you get a difference on different
> systems, I'd
> still suspect the reason to be different libxml2 versions. There's
> nothing
> lxml can do about this.
It should still be outputting an element with a name of "cs:content",
it shouldn't be dropping the "cs:", as, as you say, there are not
namespaces in HTML, so it has no meaning.
My basic advice to the OP would be to use html5lib, which is far
slower, but does cope with this fine.
--
Geoffrey Sneddon
<http://gsnedders.com/>
More information about the lxml-dev
mailing list