[Lxml-checkins] r53742 - in lxml/trunk: . doc

scoder at codespeak.net scoder at codespeak.net
Sun Apr 13 18:30:08 CEST 2008


Author: scoder
Date: Sun Apr 13 18:30:06 2008
New Revision: 53742

Modified:
   lxml/trunk/   (props changed)
   lxml/trunk/doc/elementsoup.txt
Log:
 r3953 at delle:  sbehnel | 2008-04-13 18:28:47 +0200
 doc update, section on using lxml.html.soupparser as a fallback


Modified: lxml/trunk/doc/elementsoup.txt
==============================================================================
--- lxml/trunk/doc/elementsoup.txt	(original)
+++ lxml/trunk/doc/elementsoup.txt	Sun Apr 13 18:30:06 2008
@@ -3,7 +3,7 @@
 ====================
 
 BeautifulSoup_ is a Python package that parses broken HTML.  While libxml2
-(and thus lxml) can also parse broken HTML, BeautifulSoup is somewhat more
+(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
 forgiving and has superiour `support for encoding detection`_.
 
 .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
@@ -13,7 +13,7 @@
 lxml can benefit from the parsing capabilities of BeautifulSoup
 through the ``lxml.html.soupparser`` module.  It provides three main
 functions: ``fromstring()`` and ``parse()`` to parse a string or file
-using BeautifulSoup, and `convert_tree()` to convert an existing
+using BeautifulSoup, and ``convert_tree()`` to convert an existing
 BeautifulSoup tree into a list of top-level Elements.
 
 The functions ``fromstring()`` and ``parse()`` behave as known from
@@ -58,6 +58,10 @@
 ``makeelement`` factory function to ``parse()`` and ``fromstring()``.
 By default, this is based on the HTML parser defined in ``lxml.html``.
 
+
+Entity handling
+===============
+
 By default, the BeautifulSoup parser also replaces the entities it
 finds by their character equivalent.
 
@@ -94,3 +98,45 @@
 
     >>> tostring(body, method="html", encoding=unicode)
     u'<body>\xa9\u20ac-\xf5\u01bd<p></p></body>'
+
+
+Using soupparser as a fallback
+==============================
+
+The downside of using this parser is that it is `much slower`_ than
+the HTML parser of lxml.  So if performance matters, you might want to
+consider using ``soupparser`` only as a fallback for certain cases.
+
+.. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
+
+One common problem of lxml's parser is that it might not get the
+encoding right in cases where the document contains a ``<meta>`` tag
+at the wrong place.  In this case, you can exploit the fact that lxml
+serialises much faster than most other HTML libraries for Python.
+Just serialise the document to unicode and if that gives you an
+exception, re-parse it with BeautifulSoup to see if that works
+better.
+
+.. sourcecode:: pycon
+
+    >>> tag_soup = '''\
+    ... <meta http-equiv="Content-Type"
+    ...       content="text/html;charset=utf-8" />
+    ... <html>
+    ...   <head>
+    ...     <title>Hello W\xc3\xb6rld!</title>
+    ...   </head>
+    ...   <body>Hi all</body>
+    ... </html>'''
+
+    >>> import lxml.html
+    >>> import lxml.html.soupparser
+
+    >>> root = lxml.html.fromstring(tag_soup)
+    >>> try:
+    ...     ignore = tostring(root, encoding=unicode)
+    ... except UnicodeDecodeError:
+    ...     root = lxml.html.soupparser.fromstring(tag_soup)
+    ...     # try again, but don't catch the exception this time
+    ...     ignore = tostring(root, encoding=unicode)
+


More information about the lxml-checkins mailing list