[lxml-dev] lxml and html encodings

Luke Tucker ltucker at openplans.org
Wed Oct 11 16:10:21 CEST 2006


> Does etree.HTML() pay any attention to <meta http-equiv="Content-Type" 
> content="text/html; charset=Shift_JIS"> ? 
[...]
> I think for HTML it is better if the encoding is determined before 
> parsing, as there's several types of information that come into play.  I 
> think the FAQ entry doesn't really apply here, since it isn't really 
> XML.
[...]

I'm not certain.  The FAQ entry says that using HTML unicode strings
with charset meta tags also does not work. I thought that meant parsing
via etree.HTML(). We can certainly extract the encoding and decode to a
unicode string before calling the parser, but it seemed like we ought to
get some clarification on the intended behavior as well. 

- Luke 





More information about the lxml-dev mailing list