[lxml-dev] lxml.html, now with ignored namespaces!

Thomas Weigel seasong at chantofwaves.com
Tue Jun 23 02:53:34 CEST 2009


I am using lxml to parse HTML documents, which include a custom 
namespace (for example, "<p cs:content='fruit'>FRUIT</p>").

In lxml 2.2.0, on Windows, this worked just fine, and elements could be 
processed based on this data.

In lxml 2.2.2, on Linux, this fails. The above example becomes "<p 
content='fruit'>FRUIT</p>" as soon as it is parsed by lxml.html (or 
lxml.etree.HTMLParser()).

I don't know if this is caused by the switch to Linux, or the upgrade to 
2.2.2. I don't have control over the installation, so I can't switch to 
2.2.2 under Windows, or 2.2.0 under Linux to check.

I did find this reference (the only reference to this I could find) to 
the HTML ignoring namespaces:
http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests

...however, it wasn't doing that before, and it seems odd that this is 
only mentioned in the doctests section.

Is there a way to work around this? Are custom namespaces simply not 
possible in lxml's HTML?

Notes:

1. The XML parser will not work. Some documents will have legal HTML 
that breaks an XML parser, like "<br>".

2. Here is the sample code:

-----
 >>> import lxml.html as parser
 >>> document = parser.fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD 
XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html 
xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
xmlns:cs="http://something.com/cs" xml:lang="en" 
lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>""")
 >>> print parser.tostring(document)
-----

The output:
-----
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" 
cs="http://something.com/cs" xml:lang="en" 
lang="en"><head><title>Help!</title></head><body><p>My namespaces are 
going to disappear!</p><p content="fruit">FRUIT</p></body></html>
-----


Thomas Weigel



More information about the lxml-dev mailing list