[lxml-dev] HTMLParser encoding

Max Ivanov ivanov.maxim at gmail.com
Sat Aug 23 14:52:33 CEST 2008


>> If there is no meta tag with defined document encoding, how HTMLParser
>> converts text data into Unicode? Does it contain some encoding
>> detection machinery?
>
> Yes, but that's implemented in libxml2 and I don't know much about the
> details. There are some ways to help it, though, in case it gets it wrong. If
> you can provide the proper encoding (e.g. as provided through HTTP, MIME or
> some other source), you can pass it to the parser when you create it. Or, you
> can decode the data to a unicode string and pass that to the parser.

I plan to user chardet module (http://chardet.feedparser.org/) to
detect charset if no meta tag is present. chardet needs untouched text
 for proper detection, I couldn't pass to it unicode text from
element.text ot .text_content() also I couldnt pass plain text full of
tags since it make chardet return wrong results. Is there any way to
restore original text from element.text or text_content()?


More information about the lxml-dev mailing list