[lxml-dev] lxml and html encodings
Ian Bicking
ianb at colorstudy.com
Wed Oct 11 00:16:51 CEST 2006
Chris Abraham wrote:
> Hello,
> We are getting some unexpected behavior when processing documents with a
> Shift_JIS encoding.
> We are trying to serialize an HTML document using an XSLT transform.
> Our results don't agree with the FAQ:
> http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings.
> Please see the comments in the attached demo.py which reads in home.html
> and demonstrates our problem.
Does etree.HTML() pay any attention to <meta http-equiv="Content-Type"
content="text/html; charset=Shift_JIS"> ? I notice it generates that
tag (through the XSL I assume), but the parser doesn't necessarily have
the same logic.
I think for HTML it is better if the encoding is determined before
parsing, as there's several types of information that come into play. I
think the FAQ entry doesn't really apply here, since it isn't really
XML. This library probably has the best rules for determining encoding:
http://chardet.feedparser.org/
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
More information about the lxml-dev
mailing list