[lxml-dev] Weird errors in tostring

Stefan Behnel stefan_ml at behnel.de
Sun Apr 13 09:04:19 CEST 2008


Hi,

Bruno wrote:
> In the other machine all goes well. FYI, the tree (root variable) is being 
> built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and
> iso-8859-1, and this particular backtrace happened in a HTML document 
> correctly labelled with a meta charset=iso-8859-1. 

You can ask the document which encoding it was parsed with:

    >>> print root.getroottree().docinfo.encoding

It should say "iso-8859-1" if the parser picked up the <meta> tag correctly.

Also, maybe the <meta> tag comes behind the <title> in the document? AFAIR,
libxml2's HTML parser switches encodings when it sees a <meta> declaration,
but it doesn't reparse the document (as most browsers do to work around this
problem).

If the parser gets the encoding wrong, you can try parsing with BeautifulSoup
(separate install) by using the fromstring() function in lxml.html.ElementSoup
instead. That's quite a bit slower, but it *might* give you better results in
this case.

http://codespeak.net/lxml/elementsoup.html

(note that the soupparser module was added in 2.0.3 to fix the parse()
function. Just use the ElementSoup module in 2.0.2)

Stefan


More information about the lxml-dev mailing list