[lxml-dev] lxml and html encodings
Chris Abraham
cabraham at openplans.org
Tue Oct 17 21:47:10 CEST 2006
Stefan,
Thanks for this. Who should I contact to get the FAQ updated?
http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
It states that lxml "will not parse" Python unicode strings that carry
encoding info. But here we see that it does.
Also in the API's specific to lxml:
http://codespeak.net/lxml/api.html
"Similarly, you will get errors when you try the same with HTML data in
a unicode string that specifies a charset in a meta tag of the header.
You should generally avoid converting XML/HTML data to unicode before
passing it into the parsers. It is both slower and error prone."
...just a minor detail but thought it was worth following up on.
Chris
Stefan Behnel wrote:
> Hi,
>
> Chris Abraham wrote:
>
>> We are getting some unexpected behavior when processing documents with a
>> Shift_JIS encoding.
>> We are trying to serialize an HTML document using an XSLT transform.
>> Our results don't agree with the FAQ:
>> http://codespeak.net/lxml/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings.
>> Please see the comments in the attached demo.py which reads in home.html
>> and demonstrates our problem.
>>
>
> I looked into it and found that the behaviour of the libxml2 parser depends on
> the position of the <meta> tag. Your HTML is pretty broken in many regards.
> However, when you move the <meta> tag within <head> and before any text
> (especially before the <title> tag), it is treated correctly.
>
> I attached a modified HTML file that parses nicely and serialises into UTF-8.
>
> So, the right place to ask this question is on the libxml2 mailing list, not
> on the lxml mailing list.
>
> Stefan
>
>
> !DSPAM:1018,452fb2f5125711410093335!
>
> ------------------------------------------------------------------------
>
> 猫型 !DSPAM:1018,452fb2f5125711410093335!
>
More information about the lxml-dev
mailing list