[lxml-dev] HTMLParser encoding
Stefan Behnel
stefan_ml at behnel.de
Sat Aug 23 14:02:05 CEST 2008
Hi,
Max Ivanov wrote:
> If there is no meta tag with defined document encoding, how HTMLParser
> converts text data into Unicode? Does it contain some encoding
> detection machinery?
Yes, but that's implemented in libxml2 and I don't know much about the
details. There are some ways to help it, though, in case it gets it wrong. If
you can provide the proper encoding (e.g. as provided through HTTP, MIME or
some other source), you can pass it to the parser when you create it. Or, you
can decode the data to a unicode string and pass that to the parser.
Stefan
More information about the lxml-dev
mailing list