[Lxml-checkins] r53746 - lxml/branch/lxml-2.0/doc
scoder at codespeak.net
scoder at codespeak.net
Sun Apr 13 20:28:23 CEST 2008
Author: scoder
Date: Sun Apr 13 20:28:20 2008
New Revision: 53746
Modified:
lxml/branch/lxml-2.0/doc/elementsoup.txt
Log:
partial doc merge from trunk
Modified: lxml/branch/lxml-2.0/doc/elementsoup.txt
==============================================================================
--- lxml/branch/lxml-2.0/doc/elementsoup.txt (original)
+++ lxml/branch/lxml-2.0/doc/elementsoup.txt Sun Apr 13 20:28:20 2008
@@ -3,22 +3,28 @@
====================
BeautifulSoup_ is a Python package that parses broken HTML. While libxml2
-(and thus lxml) can also parse broken HTML, BeautifulSoup is much more
+(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
forgiving and has superiour `support for encoding detection`_.
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit
+.. _ElementSoup: http://effbot.org/zone/element-soup.htm
lxml can benefit from the parsing capabilities of BeautifulSoup
through the ``lxml.html.soupparser`` module. It provides three main
functions: ``fromstring()`` and ``parse()`` to parse a string or file
-using BeautifulSoup, and `convert_tree()` to convert an existing
+using BeautifulSoup, and ``convert_tree()`` to convert an existing
BeautifulSoup tree into a list of top-level Elements.
The functions ``fromstring()`` and ``parse()`` behave as known from
ElementTree. The first returns a root Element, the latter returns an
ElementTree.
+There is also a legacy module called ``lxml.html.ElementSoup``, which
+mimics the interface provided by ElementTree's own ElementSoup_
+module. Note that the ``soupparser`` module was added in lxml 2.0.3.
+Previous versions of lxml 2.0.x only have the ``ElementSoup`` module.
+
Here is a document full of tag soup, similar to, but not quite like, HTML::
>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'
@@ -47,6 +53,10 @@
``makeelement`` factory function to ``parse()`` and ``fromstring()``.
By default, this is based on the HTML parser defined in ``lxml.html``.
+
+Entity handling
+===============
+
By default, the BeautifulSoup parser also replaces the entities it
finds by their character equivalent::
@@ -83,4 +93,41 @@
mimics the interface provided by ElementTree's own ElementSoup_
module.
-.. _ElementSoup: http://effbot.org/zone/element-soup.htm
+
+Using soupparser as a fallback
+==============================
+
+The downside of using this parser is that it is `much slower`_ than
+the HTML parser of lxml. So if performance matters, you might want to
+consider using ``soupparser`` only as a fallback for certain cases.
+
+.. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
+
+One common problem of lxml's parser is that it might not get the
+encoding right in cases where the document contains a ``<meta>`` tag
+at the wrong place. In this case, you can exploit the fact that lxml
+serialises much faster than most other HTML libraries for Python.
+Just serialise the document to unicode and if that gives you an
+exception, re-parse it with BeautifulSoup to see if that works
+better::
+
+ >>> tag_soup = '''\
+ ... <meta http-equiv="Content-Type"
+ ... content="text/html;charset=utf-8" />
+ ... <html>
+ ... <head>
+ ... <title>Hello W\xc3\xb6rld!</title>
+ ... </head>
+ ... <body>Hi all</body>
+ ... </html>'''
+
+ >>> import lxml.html
+ >>> import lxml.html.soupparser
+
+ >>> root = lxml.html.fromstring(tag_soup)
+ >>> try:
+ ... ignore = tostring(root, encoding=unicode)
+ ... except UnicodeDecodeError:
+ ... root = lxml.html.soupparser.fromstring(tag_soup)
+ ... # try again, but don't catch the exception this time
+ ... ignore = tostring(root, encoding=unicode)
More information about the lxml-checkins
mailing list