[lxml-dev] Encoding problems with lxml

Roger Patterson rogerpatterson at gmail.com
Thu Jun 28 21:42:04 CEST 2007


Hi Bruno,
I had similar problems, except my HTML was even more broken, so I ended
up using the Elementtree.TidyHTMLTreeBuilder to first parse the page,
then converted the result string with etree.XML() to an lxml tree.
This didn't solve the encoding problem, just the broken HTML problem.
For encoding detection, check out Beautifulsoup, which has very kindly
functional-ized its encoding detection (
http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit
)

And Leonard basically got this encoding detection from here: 
http://chardet.feedparser.org

Good luck!
-Roger

Bruno Barberi Gnecco wrote:
> Stefan Behnel wrote:
>
>     Thanks a lot for the prompt answer, Stefan.
>
>   
>>> I'm having some encoding problems with lxml that I can't solve. My application
>>> is a small web mining spider. Pages downloaded can be in any encoding, but I'm
>>> expecting mostly utf8 and iso-8859-1. I need to get the parsed data in
>>>       
> iso-8859-1.
>   
>>
>> Note that this may already fail in the decoding step of the parser. If the
>> HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it
>> will not know what encoding to use.
>>     
>
>     I see, this is what is happening. I even found a page that declared to
> be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get
> over that somehow, so I haven't even noticed it.
>
>   
>>> I'm having two problems:
>>>
>>> a) when reading pages in iso-8859-1, accented characters are converted to HTML
>>> sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
>>>       
>>
>> You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do
>> that for you, but you can easily implement that yourself.
>>
>> Look for "Serialization" in
>> http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py
>>     
>
> ....
>
>   
>> I only noticed now that this was referring to parsing. Any reason you don't
>> want entities resolved her?
>>
>> lxml 2.0 will allow you to keep entities in the tree, although they are rarely
>> of any help.
>>     
>
>     I don't want the characters converted to entities because I'm adding
> the extracted data to a searchable database (which already exists). If I use
> sequences such as & ccedil; or & #233;, searching will fail.
>
>     You may suggest to convert the search string to sequences, but then
> I miss useful features such as case insensivity and unaccented words
> matching accented ones.
>
>     I'd prefer to have a standard encoding on the tree, because my
> xpath query might contain accented characters. I don't care how it is
> encoded in the tree, as long as I can query it.
>
>     And, when I extract the data with etree.tostring(entry, 'iso-8859-1')
> (entry being the result of a xpath query or a find(tag)), I'd like to have
> iso characters instead of sequences, whatever the original encoding. This I
> haven't been able to do yet, any tips?
>
>   
>>> b) I can't convert pages originally in UTF to ISO, even using
>>> etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
>>>       
>>
>> Both should work in general (the first being better anyway) - except when you
>> have a <meta> tag in there that says "utf-8" encoding. Then you can't expect
>> the browser to ignore that. lxml will not magically delete it either, you have
>> to do that by hand.
>>     
>
>
>     But shouldn't tostring() convert to iso, even if it was in utf-8?
>
>   
>>> Have I missed something in the docs? I want to have a homogeneous behavior for
>>> all encodings--even if it means to convert first to UTF and later to ISO. 
>>>       
>>
>> You don't have to, at least, not for working on the tree. lxml will properly
>> encode strings to Python (unicode) strings at the API level - *iff* the parser
>> managed to detect the encoding of the HTML page. If not, you will get garbage.
>> But then that's really the fault of the page.
>>
>> If you have any other way to detect the encoding of a broken page (e.g. all
>> pages from a specific source are undeclared UTF-8 or something), you can also
>> pre-treat the input *before* parsing it, i.e. recode it properly and remove
>> the <meta> tag with a regular expression. Then the parser should no longer
>> have any problems.
>>     
>
>
>     Hm, I see. Everyday I remember that adagio from Tanenbaum,
> "The good thing about standards is that there are so many to choose from."
>
>     I suppose the most robust solution then is to try to find the encoding of
> the page myself, and make sure that <meta> is correct, and possibly check
> et.docinfo.encoding to see if lxml got it right. Is that it?
>
>     Thanks again,
>
> Bruno
>
>
>
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>   


More information about the lxml-dev mailing list