[Lxml-checkins] r33487 - lxml/branch/lxml-1.1/doc

scoder at codespeak.net scoder at codespeak.net
Fri Oct 20 08:56:11 CEST 2006


Author: scoder
Date: Fri Oct 20 08:56:10 2006
New Revision: 33487

Modified:
   lxml/branch/lxml-1.1/doc/FAQ.txt
   lxml/branch/lxml-1.1/doc/api.txt
Log:
notes on HTML parsing

Modified: lxml/branch/lxml-1.1/doc/FAQ.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/FAQ.txt	(original)
+++ lxml/branch/lxml-1.1/doc/FAQ.txt	Fri Oct 20 08:56:10 2006
@@ -240,10 +240,14 @@
 does not.  However, if the unicode string declares an XML encoding internally
 (``<?xml encoding="..."?>``), parsing is bound to fail, as this encoding is
 most likely not the real encoding used in Python unicode.  The same is true
-for HTML unicode strings that contain charset meta tags.  Note that Python
-uses different encodings for unicode on different platforms, so even
-specifying the real internal unicode encoding is not portable between Python
-interpreters.  Don't do it.
+for HTML unicode strings that contain charset meta tags, although the problems
+may be more subtle here.  The libxml2 HTML parser may not be able to parse the
+meta tags in broken HTML and simply ignore them, so even if parsing succeeds,
+later handling may still fail with character encoding errors.
+
+Note that Python uses different encodings for unicode on different platforms,
+so even specifying the real internal unicode encoding is not portable between
+Python interpreters.  Don't do it.
 
 Python unicode strings with XML data or HTML data that carry encoding
 information are broken.  lxml will not parse them.  You must provide parsable

Modified: lxml/branch/lxml-1.1/doc/api.txt
==============================================================================
--- lxml/branch/lxml-1.1/doc/api.txt	(original)
+++ lxml/branch/lxml-1.1/doc/api.txt	Fri Oct 20 08:56:10 2006
@@ -192,9 +192,8 @@
 
 HTML parsing is similarly simple.  The parsers have a ``recover`` keyword
 argument that the HTMLParser sets by default.  It lets libxml2 try its best to
-return something usable without raising an exception.  Note that this
-functionality depends entirely on libxml2.  You should use libxml2 version
-2.6.21 or newer to take advantage of this feature::
+return something usable without raising an exception.  You should use libxml2
+version 2.6.21 or newer to take advantage of this feature::
 
   >>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
 
@@ -211,6 +210,14 @@
   >>> print etree.tostring(html)
   <html><head><title>test</title></head><body><h1>page title</h1></body></html>
 
+The support for parsing broken HTML depends entirely on libxml2's recovery
+algorithm.  It is *not* the fault of lxml if you find documents that are so
+heavily broken that the parser cannot handle them.  There is also no guarantee
+that the resulting tree will contain all data from the original document.  The
+parser may have to drop seriously broken parts when struggling to keep
+parsing.  Especially misplaced meta tags can suffer from this, which may lead
+to encoding problems.
+
 The use of the libxml2 parsers makes some additional information available at
 the API level.  Currently, ElementTree objects can access the DOCTYPE
 information provided by a parsed document, as well as the XML version and the


More information about the lxml-checkins mailing list