[lxml-dev] Encoding problems with lxml

Stefan Behnel stefan_ml at behnel.de
Thu Jun 28 09:05:18 CEST 2007



Bruno Barberi Gnecco wrote:
> I'm having some encoding problems with lxml that I can't solve. My application
> is a small web mining spider. Pages downloaded can be in any encoding, but I'm
> expecting mostly utf8 and iso-8859-1. I need to get the parsed data in 
> iso-8859-1.

Note that this may already fail in the decoding step of the parser. If the
HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it
will not know what encoding to use.


> I'm having two problems:
> 
> a) when reading pages in iso-8859-1, accented characters are converted to HTML
> sequences, such as à for ` + a. I don't want this to happen, how to avoid it?

You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do
that for you, but you can easily implement that yourself.

Look for "Serialization" in
http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py


> b) I can't convert pages originally in UTF to ISO, even using
> etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").

Both should work in general (the first being better anyway) - except when you
have a <meta> tag in there that says "utf-8" encoding. Then you can't expect
the browser to ignore that. lxml will not magically delete it either, you have
to do that by hand.

IIRC, the XSLT serialisation step should also add one for you.


> Have I missed something in the docs? I want to have a homogeneous behavior for
> all encodings--even if it means to convert first to UTF and later to ISO. 

You don't have to, at least, not for working on the tree. lxml will properly
encode strings to Python (unicode) strings at the API level - *iff* the parser
managed to detect the encoding of the HTML page. If not, you will get garbage.
But then that's really the fault of the page.

If you have any other way to detect the encoding of a broken page (e.g. all
pages from a specific source are undeclared UTF-8 or something), you can also
pre-treat the input *before* parsing it, i.e. recode it properly and remove
the <meta> tag with a regular expression. Then the parser should no longer
have any problems.

Stefan



More information about the lxml-dev mailing list