[Lxml-checkins] r53742 - in lxml/trunk: . doc
scoder at codespeak.net
scoder at codespeak.net
Sun Apr 13 18:30:08 CEST 2008
Author: scoder
Date: Sun Apr 13 18:30:06 2008
New Revision: 53742
Modified:
lxml/trunk/ (props changed)
lxml/trunk/doc/elementsoup.txt
Log:
r3953 at delle: sbehnel | 2008-04-13 18:28:47 +0200
doc update, section on using lxml.html.soupparser as a fallback
Modified: lxml/trunk/doc/elementsoup.txt
==============================================================================
--- lxml/trunk/doc/elementsoup.txt (original)
+++ lxml/trunk/doc/elementsoup.txt Sun Apr 13 18:30:06 2008
@@ -3,7 +3,7 @@
====================
BeautifulSoup_ is a Python package that parses broken HTML. While libxml2
-(and thus lxml) can also parse broken HTML, BeautifulSoup is somewhat more
+(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
forgiving and has superiour `support for encoding detection`_.
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
@@ -13,7 +13,7 @@
lxml can benefit from the parsing capabilities of BeautifulSoup
through the ``lxml.html.soupparser`` module. It provides three main
functions: ``fromstring()`` and ``parse()`` to parse a string or file
-using BeautifulSoup, and `convert_tree()` to convert an existing
+using BeautifulSoup, and ``convert_tree()`` to convert an existing
BeautifulSoup tree into a list of top-level Elements.
The functions ``fromstring()`` and ``parse()`` behave as known from
@@ -58,6 +58,10 @@
``makeelement`` factory function to ``parse()`` and ``fromstring()``.
By default, this is based on the HTML parser defined in ``lxml.html``.
+
+Entity handling
+===============
+
By default, the BeautifulSoup parser also replaces the entities it
finds by their character equivalent.
@@ -94,3 +98,45 @@
>>> tostring(body, method="html", encoding=unicode)
u'<body>\xa9\u20ac-\xf5\u01bd<p></p></body>'
+
+
+Using soupparser as a fallback
+==============================
+
+The downside of using this parser is that it is `much slower`_ than
+the HTML parser of lxml. So if performance matters, you might want to
+consider using ``soupparser`` only as a fallback for certain cases.
+
+.. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
+
+One common problem of lxml's parser is that it might not get the
+encoding right in cases where the document contains a ``<meta>`` tag
+at the wrong place. In this case, you can exploit the fact that lxml
+serialises much faster than most other HTML libraries for Python.
+Just serialise the document to unicode and if that gives you an
+exception, re-parse it with BeautifulSoup to see if that works
+better.
+
+.. sourcecode:: pycon
+
+ >>> tag_soup = '''\
+ ... <meta http-equiv="Content-Type"
+ ... content="text/html;charset=utf-8" />
+ ... <html>
+ ... <head>
+ ... <title>Hello W\xc3\xb6rld!</title>
+ ... </head>
+ ... <body>Hi all</body>
+ ... </html>'''
+
+ >>> import lxml.html
+ >>> import lxml.html.soupparser
+
+ >>> root = lxml.html.fromstring(tag_soup)
+ >>> try:
+ ... ignore = tostring(root, encoding=unicode)
+ ... except UnicodeDecodeError:
+ ... root = lxml.html.soupparser.fromstring(tag_soup)
+ ... # try again, but don't catch the exception this time
+ ... ignore = tostring(root, encoding=unicode)
+
More information about the lxml-checkins
mailing list