[lxml-dev] lxml and html encodings

Stefan Behnel behnel_ml at gkec.informatik.tu-darmstadt.de
Thu Oct 12 18:48:54 CEST 2006


Hi,

Chris Abraham wrote:
> We are getting some unexpected behavior when processing documents with a
> Shift_JIS encoding. 
> We are trying to serialize an HTML document using an XSLT transform. 
> Our results don't agree with the FAQ:
> http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings. 
> Please see the comments in the attached demo.py which reads in home.html
> and demonstrates our problem.

I looked into it and found that the behaviour of the libxml2 parser depends on
the position of the <meta> tag. Your HTML is pretty broken in many regards.
However, when you move the <meta> tag within <head> and before any text
(especially before the <title> tag), it is treated correctly.

I attached a modified HTML file that parses nicely and serialises into UTF-8.

So, the right place to ask this question is on the libxml2 mailing list, not
on the lxml mailing list.

Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20061012/cc9f832d/attachment.html 


More information about the lxml-dev mailing list