[lxml-dev] lxml.html, now with ignored namespaces!

Stefan Behnel stefan_ml at behnel.de
Sat Jun 27 07:23:10 CEST 2009


Hi,

I actually didn't read up to your example, sorry.

Thomas Weigel wrote:
> I am using lxml to parse HTML documents, which include a custom 
> namespace (for example, "<p cs:content='fruit'>FRUIT</p>").
> 
> Notes:
> 
> 1. The XML parser will not work. Some documents will have legal HTML 
> that breaks an XML parser, like "<br>".
> 
> 2. Here is the sample code:
> 
> -----
>  >>> import lxml.html as parser
>  >>> document = parser.fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD 
> XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html 
> xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
> xmlns:cs="http://something.com/cs" xml:lang="en" 
> lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
> going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>""")
>  >>> print parser.tostring(document)
> -----

That's an XHTML document, for which the XML parser would be the right tool.
If you have XHTML documents that contain unterminated <br> tags, they are
not well-formed, and thus simply not XML, i.e. not XHTML. But you could try
creating a custom XMLParser with the "recover" option, which will try to
keep parsing despite errors. There's no guarantee that it won't kick out
some data that it failed to parse, though, as usual when parsing broken
documents.

Obviously, the best way to deal with this kind of problem is fixing the
input documents.


> The output:
> -----
> <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
> cs="http://something.com/cs" xml:lang="en" 
> lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
> going to disappear!</p><p content="fruit">FRUIT</p></body></html>
> -----

That's because HTML parsers are not namespace aware. Namespaces are simply
not defined for HTML. But if you get a difference on different systems, I'd
still suspect the reason to be different libxml2 versions. There's nothing
lxml can do about this.

Stefan


More information about the lxml-dev mailing list