[lxml-dev] Encoding problems with lxml

Stefan Behnel stefan_ml at behnel.de
Fri Jun 29 08:23:11 CEST 2007



Bruno Barberi Gnecco wrote:
> Stefan Behnel wrote:
> 
>     Thanks a lot for the prompt answer, Stefan.
> 
>>> I'm having some encoding problems with lxml that I can't solve. My application
>>> is a small web mining spider. Pages downloaded can be in any encoding, but I'm
>>> expecting mostly utf8 and iso-8859-1. I need to get the parsed data in
> iso-8859-1.
>>
>>
>> Note that this may already fail in the decoding step of the parser. If the
>> HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it
>> will not know what encoding to use.
> 
>     I see, this is what is happening. I even found a page that declared to
> be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get
> over that somehow, so I haven't even noticed it.

There's not much lxml (or libxml2) can do about this. While libxml2 is pretty
good in parsing broken HTML, it still can't handle plain tag soup and it
believes a page that explicitly says "I use that encoding".


>>> a) when reading pages in iso-8859-1, accented characters are converted to HTML
>>> sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
>>
>> I only noticed now that this was referring to parsing. Any reason you don't
>> want entities resolved her?
>>
>> lxml 2.0 will allow you to keep entities in the tree, although they are rarely
>> of any help.
> 
>     I don't want the characters converted to entities because I'm adding
> the extracted data to a searchable database (which already exists). If I use
> sequences such as & ccedil; or & #233;, searching will fail.

The parser will not convert any characters to entities, it will give you
Unicode strings (or plain strings if it's ASCII). Only the serialiser *may*
create entities, depending on the encoding. Look at what you get in the text
content inside the tree (not the serialised document), I'd be surprised if you
found any entity names in there.


>     I'd prefer to have a standard encoding on the tree, because my
> xpath query might contain accented characters. I don't care how it is
> encoded in the tree, as long as I can query it.

lxml supports unicode everywhere.


>     And, when I extract the data with etree.tostring(entry, 'iso-8859-1')
> (entry being the result of a xpath query or a find(tag)), I'd like to have
> iso characters instead of sequences, whatever the original encoding. This I
> haven't been able to do yet, any tips?

Hmmm, interesting. Are you sure the document was parsed correctly? Check the
tree to see if you get the correct texts (and not some weird Unicode
characters) in there. Here's what I get:

  >>> import lxml.etree as et
  >>> html = et.HTML("<html><body>üüaösäüöéèàádäüaöü</body></html>")
  >>> et.tostring(html, encoding="iso-8859-1")
  "<?xml version='1.0'
encoding='iso-8859-1'?>\n<html><body>\xfc\xfca\xf6s\xe4\xfc\xf6\xe9\xe8\xe0\xe1d\xe4\xfca\xf6\xfc</body></html>"

No entities in there, just plain Latin-1 characters. I really think you hit a
case where the parser detected the wrong encoding, so that the tree couldn't
get serialised to your target encoding afterwards.


>>> b) I can't convert pages originally in UTF to ISO, even using
>>> etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
>>
>> Both should work in general (the first being better anyway) - except when you
>> have a <meta> tag in there that says "utf-8" encoding. Then you can't expect
>> the browser to ignore that. lxml will not magically delete it either, you have
>> to do that by hand.
> 
>     But shouldn't tostring() convert to iso, even if it was in utf-8?

That's why you get entities. It's not ISO 8859-1 that's in the tree - at least
not from the point of view of the parser.


>     I suppose the most robust solution then is to try to find the encoding of
> the page myself, and make sure that <meta> is correct, and possibly check
> et.docinfo.encoding to see if lxml got it right. Is that it?

In your case, that's definitely the safest bet, especially if you only have
two possible input encodings.

I'd do this: try decoding it from UTF-8 first (Python's "...".decode()) and if
that fails, fall back to decoding it as ISO 8859-1. UTF-8 is a well defined
multi-byte encoding, so it's relatively easy to distinguish from single-byte
encodings such as ISO-8859-x. Then remove any <meta> encoding tags you find
(use a regexp) and pass it into the HTML() factory *as a Python unicode string*.

  >>> import lxml.etree as et
  >>> input = u"<html><body>üüaösäüöéèàádäüaöü</body></html>" # Unicode !
  >>> html = et.HTML(input)

  >>> et.tostring(html, encoding="iso-8859-1")
  "<?xml version='1.0'
encoding='iso-8859-1'?>\n<html><body>\xfc\xfca\xf6s\xe4\xfc\xf6\xe9\xe8\xe0\xe1d\xe4\xfca\xf6\xfc</body></html>"

Does that solve your problem?

Stefan



More information about the lxml-dev mailing list