[lxml-dev] Weird errors in tostring
Stefan Behnel
stefan_ml at behnel.de
Tue Apr 15 09:38:49 CEST 2008
Hi,
Bruno Barberi Gnecco wrote:
> Hi Stephan,
-f-
>>> In the other machine all goes well. FYI, the tree (root variable) is
>>> being built with root = lxml.html.fromstring(data). I'm parsing data
>>> in utf8 and
>>> iso-8859-1, and this particular backtrace happened in a HTML document
>>> correctly labelled with a meta charset=iso-8859-1.
>>
>>
>> You can ask the document which encoding it was parsed with:
>>
>> >>> print root.getroottree().docinfo.encoding
>>
>> It should say "iso-8859-1" if the parser picked up the <meta> tag
>> correctly.
>
> It says 'None', actually.
Then that's a clear sign that libxml2 didn't pick up the encoding.
> Shouldn't it give the error when *parsing* and creating the tree,
> instead of when converting the tree to something else?
HTML is parsed with the "recover" option, which lets libxml2 try to work
around all sorts of broken page content *without* raising an error. You can
still check the error log of the parser to see what happend on the way through
the page.
> I thought lxml stored the parsed tree in unicode.
UTF-8, actually, which is much easier (and faster) to handle in C than any
other unicode encoding.
> Besides, I'm asking for a unicode string:
>
> tostring(root, method='xml', encoding=unicode)
Which lets lxml serialise the tree to a Python unicode character sequence in
XML style. I know, this looks simple, but there's actually work being done here.
>> Also, maybe the <meta> tag comes behind the <title> in the document?
>> AFAIR,
>> libxml2's HTML parser switches encodings when it sees a <meta>
>> declaration,
>> but it doesn't reparse the document (as most browsers do to work
>> around this
>> problem).
>
> It happens with fragments of HTML as well (I'm actually reading HTML
> messages). Yet I was having this problem with pages download from the
> internet, in which the encoding was incorrectly detected.
Which implies most of the time that it was incorrectly specified as well. That
is a very common problem in real world HTML pages. Browsers do a great deal of
work in their Quirks mode to figure out the page encoding.
libxml2's HTML parser works pretty well, but fortune telling wasn't one of its
design goals.
> Since I had more information
> in that case (HTTP headers, with a chardet pass just to be sure) I ended up
> forcing the encoding with a 'html.decode(encoding)' step before building
> the tree. I think it's weird that it works (since some pages declare one
> encoding and use a different one), but it does.
You might want to strip <meta> Content Type tags from the string using a
regex, that should make sure it works in all cases. Read the function
"htmlCheckEncoding()" in libxml2's HTMLparser.c to see what works and what
doesn't. For example, there is some code to prevent changing the parser
encoding a second time, so that you can override it with the "encoding" parser
keyword in lxml.
>> If the parser gets the encoding wrong, you can try parsing with
>> BeautifulSoup
>> (separate install) by using the fromstring() function in
>> lxml.html.ElementSoup
>> instead. That's quite a bit slower, but it *might* give you better
>> results in this case.
I wrote a little doc section on that topic:
http://codespeak.net/lxml/elementsoup.html#using-soupparser-as-a-fallback
> First, why does it work in one of the machines and not in the other,
> even with the same data? I installed Python2.5, but with the same results.
> Maybe the cause is libxml2 (2.6.30 where it works, 2.6.26 where it
> doesn't)?
That's almost definitely the reason, yes.
> Second, if the tree is created, how to know if the encoding is
> wrong? I only convert to string much later.
You can serialise immediately, just for testing, that will tell you. Or, you
can check the parser error log for encoding errors.
Stefan
More information about the lxml-dev
mailing list