[lxml-dev] Encoding problems with lxml
Bruno Barberi Gnecco
brunobg at gmail.com
Thu Jun 28 20:20:04 CEST 2007
Stefan Behnel wrote:
Thanks a lot for the prompt answer, Stefan.
>> I'm having some encoding problems with lxml that I can't solve. My application
>> is a small web mining spider. Pages downloaded can be in any encoding, but I'm
>> expecting mostly utf8 and iso-8859-1. I need to get the parsed data in
iso-8859-1.
>
>
>
> Note that this may already fail in the decoding step of the parser. If the
> HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it
> will not know what encoding to use.
I see, this is what is happening. I even found a page that declared to
be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get
over that somehow, so I haven't even noticed it.
>> I'm having two problems:
>>
>> a) when reading pages in iso-8859-1, accented characters are converted to HTML
>> sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
>
>
>
> You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do
> that for you, but you can easily implement that yourself.
>
> Look for "Serialization" in
> http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py
....
> I only noticed now that this was referring to parsing. Any reason you don't
> want entities resolved her?
>
> lxml 2.0 will allow you to keep entities in the tree, although they are rarely
> of any help.
I don't want the characters converted to entities because I'm adding
the extracted data to a searchable database (which already exists). If I use
sequences such as & ccedil; or & #233;, searching will fail.
You may suggest to convert the search string to sequences, but then
I miss useful features such as case insensivity and unaccented words
matching accented ones.
I'd prefer to have a standard encoding on the tree, because my
xpath query might contain accented characters. I don't care how it is
encoded in the tree, as long as I can query it.
And, when I extract the data with etree.tostring(entry, 'iso-8859-1')
(entry being the result of a xpath query or a find(tag)), I'd like to have
iso characters instead of sequences, whatever the original encoding. This I
haven't been able to do yet, any tips?
>> b) I can't convert pages originally in UTF to ISO, even using
>> etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
>
>
>
> Both should work in general (the first being better anyway) - except when you
> have a <meta> tag in there that says "utf-8" encoding. Then you can't expect
> the browser to ignore that. lxml will not magically delete it either, you have
> to do that by hand.
But shouldn't tostring() convert to iso, even if it was in utf-8?
>> Have I missed something in the docs? I want to have a homogeneous behavior for
>> all encodings--even if it means to convert first to UTF and later to ISO.
>
>
>
> You don't have to, at least, not for working on the tree. lxml will properly
> encode strings to Python (unicode) strings at the API level - *iff* the parser
> managed to detect the encoding of the HTML page. If not, you will get garbage.
> But then that's really the fault of the page.
>
> If you have any other way to detect the encoding of a broken page (e.g. all
> pages from a specific source are undeclared UTF-8 or something), you can also
> pre-treat the input *before* parsing it, i.e. recode it properly and remove
> the <meta> tag with a regular expression. Then the parser should no longer
> have any problems.
Hm, I see. Everyday I remember that adagio from Tanenbaum,
"The good thing about standards is that there are so many to choose from."
I suppose the most robust solution then is to try to find the encoding of
the page myself, and make sure that <meta> is correct, and possibly check
et.docinfo.encoding to see if lxml got it right. Is that it?
Thanks again,
Bruno
More information about the lxml-dev
mailing list