[lxml-dev] lxml and html encodings
Ian Bicking
ianb at colorstudy.com
Wed Oct 11 00:22:29 CEST 2006
Ian Bicking wrote:
> I think for HTML it is better if the encoding is determined before
> parsing, as there's several types of information that come into play. I
> think the FAQ entry doesn't really apply here, since it isn't really
> XML. This library probably has the best rules for determining encoding:
> http://chardet.feedparser.org/
Actually, now that I look at this library it's probably more clever than
necessary. Generally there should be good encoding information already
present in the request, and you don't need heuristics like this to
figure it out. Nevertheless, you should probably figure out decoding
early, before parsing. To figure out the encoding specified in the
<meta> tag, you should probably just use a regular expression (since you
can't very well parse it to figure out how to decode it before you pass
it to the parser).
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
More information about the lxml-dev
mailing list