[Lxml-checkins] r33487 - lxml/branch/lxml-1.1/doc
scoder at codespeak.net
scoder at codespeak.net
Fri Oct 20 08:56:11 CEST 2006
Author: scoder
Date: Fri Oct 20 08:56:10 2006
New Revision: 33487
Modified:
lxml/branch/lxml-1.1/doc/FAQ.txt
lxml/branch/lxml-1.1/doc/api.txt
Log:
notes on HTML parsing
Modified: lxml/branch/lxml-1.1/doc/FAQ.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/FAQ.txt (original)
+++ lxml/branch/lxml-1.1/doc/FAQ.txt Fri Oct 20 08:56:10 2006
@@ -240,10 +240,14 @@
does not. However, if the unicode string declares an XML encoding internally
(``<?xml encoding="..."?>``), parsing is bound to fail, as this encoding is
most likely not the real encoding used in Python unicode. The same is true
-for HTML unicode strings that contain charset meta tags. Note that Python
-uses different encodings for unicode on different platforms, so even
-specifying the real internal unicode encoding is not portable between Python
-interpreters. Don't do it.
+for HTML unicode strings that contain charset meta tags, although the problems
+may be more subtle here. The libxml2 HTML parser may not be able to parse the
+meta tags in broken HTML and simply ignore them, so even if parsing succeeds,
+later handling may still fail with character encoding errors.
+
+Note that Python uses different encodings for unicode on different platforms,
+so even specifying the real internal unicode encoding is not portable between
+Python interpreters. Don't do it.
Python unicode strings with XML data or HTML data that carry encoding
information are broken. lxml will not parse them. You must provide parsable
Modified: lxml/branch/lxml-1.1/doc/api.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/api.txt (original)
+++ lxml/branch/lxml-1.1/doc/api.txt Fri Oct 20 08:56:10 2006
@@ -192,9 +192,8 @@
HTML parsing is similarly simple. The parsers have a ``recover`` keyword
argument that the HTMLParser sets by default. It lets libxml2 try its best to
-return something usable without raising an exception. Note that this
-functionality depends entirely on libxml2. You should use libxml2 version
-2.6.21 or newer to take advantage of this feature::
+return something usable without raising an exception. You should use libxml2
+version 2.6.21 or newer to take advantage of this feature::
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
@@ -211,6 +210,14 @@
>>> print etree.tostring(html)
<html><head><title>test</title></head><body><h1>page title</h1></body></html>
+The support for parsing broken HTML depends entirely on libxml2's recovery
+algorithm. It is *not* the fault of lxml if you find documents that are so
+heavily broken that the parser cannot handle them. There is also no guarantee
+that the resulting tree will contain all data from the original document. The
+parser may have to drop seriously broken parts when struggling to keep
+parsing. Especially misplaced meta tags can suffer from this, which may lead
+to encoding problems.
+
The use of the libxml2 parsers makes some additional information available at
the API level. Currently, ElementTree objects can access the DOCTYPE
information provided by a parsed document, as well as the XML version and the
More information about the lxml-checkins
mailing list