[lxml-dev] HTMLParser encoding

Ian Bicking ianb at colorstudy.com
Thu Aug 28 19:33:46 CEST 2008


Max Ivanov wrote:
>>> If there is no meta tag with defined document encoding, how HTMLParser
>>> converts text data into Unicode? Does it contain some encoding
>>> detection machinery?
>> Yes, but that's implemented in libxml2 and I don't know much about the
>> details. There are some ways to help it, though, in case it gets it wrong. If
>> you can provide the proper encoding (e.g. as provided through HTTP, MIME or
>> some other source), you can pass it to the parser when you create it. Or, you
>> can decode the data to a unicode string and pass that to the parser.
> 
> I plan to user chardet module (http://chardet.feedparser.org/) to
> detect charset if no meta tag is present. chardet needs untouched text
>  for proper detection, I couldn't pass to it unicode text from
> element.text ot .text_content() also I couldnt pass plain text full of
> tags since it make chardet return wrong results. Is there any way to
> restore original text from element.text or text_content()?

Really you should run chardet before parsing the document, then parse 
the unicode document.  There's not much purpose to running chardet after 
parsing, as it's far too late to do anything useful.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org


More information about the lxml-dev mailing list