[lxml-dev] lxml and html encodings
Luke Tucker
ltucker at openplans.org
Wed Oct 11 16:10:21 CEST 2006
> Does etree.HTML() pay any attention to <meta http-equiv="Content-Type"
> content="text/html; charset=Shift_JIS"> ?
[...]
> I think for HTML it is better if the encoding is determined before
> parsing, as there's several types of information that come into play. I
> think the FAQ entry doesn't really apply here, since it isn't really
> XML.
[...]
I'm not certain. The FAQ entry says that using HTML unicode strings
with charset meta tags also does not work. I thought that meant parsing
via etree.HTML(). We can certainly extract the encoding and decode to a
unicode string before calling the parser, but it seemed like we ought to
get some clarification on the intended behavior as well.
- Luke
More information about the lxml-dev
mailing list