==================== BeautifulSoup Parser ==================== :Author: Stefan Behnel BeautifulSoup_ is a Python package that parses broken HTML. While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is much more forgiving and has superiour `support for encoding detection`_. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ .. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit lxml can benefit from the parsing capabilities of BeautifulSoup through the `lxml.html.ElementSoup` module. It provides two main functions: `parse()` to parse a file using BeautifulSoup, and `convert_tree()` to convert a BeautifulSoup tree into a list of top-level Elements. Here is a document full of tag soup, similar to, but not quite like, HTML:: >>> tag_soup = '
' all you need to do is pass it to the `parse()` function:: >>> from lxml.html.ElementSoup import parse >>> from StringIO import StringIO >>> root = parse(StringIO(tag_soup)) To see what we have here, you can serialise it:: >>> from lxml.etree import tostring >>> print tostring(root, pretty_print=True)