[lxml-dev] lxml and html encodings

Ian Bicking ianb at colorstudy.com
Wed Oct 11 00:22:29 CEST 2006


Ian Bicking wrote:
> I think for HTML it is better if the encoding is determined before 
> parsing, as there's several types of information that come into play.  I 
> think the FAQ entry doesn't really apply here, since it isn't really 
> XML.  This library probably has the best rules for determining encoding: 
> http://chardet.feedparser.org/

Actually, now that I look at this library it's probably more clever than 
necessary.  Generally there should be good encoding information already 
present in the request, and you don't need heuristics like this to 
figure it out.  Nevertheless, you should probably figure out decoding 
early, before parsing.  To figure out the encoding specified in the 
<meta> tag, you should probably just use a regular expression (since you 
can't very well parse it to figure out how to decode it before you pass 
it to the parser).

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org


More information about the lxml-dev mailing list