[lxml-dev] lxml and html encodings

Ian Bicking ianb at colorstudy.com
Wed Oct 11 00:16:51 CEST 2006


Chris Abraham wrote:
> Hello,
> We are getting some unexpected behavior when processing documents with a
> Shift_JIS encoding. 
> We are trying to serialize an HTML document using an XSLT transform. 
> Our results don't agree with the FAQ:
> http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. 
> Please see the comments in the attached demo.py which reads in home.html
> and demonstrates our problem.

Does etree.HTML() pay any attention to <meta http-equiv="Content-Type" 
content="text/html; charset=Shift_JIS"> ?  I notice it generates that 
tag (through the XSL I assume), but the parser doesn't necessarily have 
the same logic.

I think for HTML it is better if the encoding is determined before 
parsing, as there's several types of information that come into play.  I 
think the FAQ entry doesn't really apply here, since it isn't really 
XML.  This library probably has the best rules for determining encoding: 
http://chardet.feedparser.org/

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org


More information about the lxml-dev mailing list