[lxml-dev] lxml.html, now with ignored namespaces!
Thomas Weigel
seasong at chantofwaves.com
Tue Jun 23 02:53:34 CEST 2009
I am using lxml to parse HTML documents, which include a custom
namespace (for example, "<p cs:content='fruit'>FRUIT</p>").
In lxml 2.2.0, on Windows, this worked just fine, and elements could be
processed based on this data.
In lxml 2.2.2, on Linux, this fails. The above example becomes "<p
content='fruit'>FRUIT</p>" as soon as it is parsed by lxml.html (or
lxml.etree.HTMLParser()).
I don't know if this is caused by the switch to Linux, or the upgrade to
2.2.2. I don't have control over the installation, so I can't switch to
2.2.2 under Windows, or 2.2.0 under Linux to check.
I did find this reference (the only reference to this I could find) to
the HTML ignoring namespaces:
http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests
...however, it wasn't doing that before, and it seems odd that this is
only mentioned in the doctests section.
Is there a way to work around this? Are custom namespaces simply not
possible in lxml's HTML?
Notes:
1. The XML parser will not work. Some documents will have legal HTML
that breaks an XML parser, like "<br>".
2. Here is the sample code:
-----
>>> import lxml.html as parser
>>> document = parser.fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD
XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html
xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
xmlns:cs="http://something.com/cs" xml:lang="en"
lang="en"><head><title>Help!</title></head><body><p>My namespaces are
going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>""")
>>> print parser.tostring(document)
-----
The output:
-----
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml"
cs="http://something.com/cs" xml:lang="en"
lang="en"><head><title>Help!</title></head><body><p>My namespaces are
going to disappear!</p><p content="fruit">FRUIT</p></body></html>
-----
Thomas Weigel
More information about the lxml-dev
mailing list