[lxml-dev] lxml deconding problem caused by <meta> tag specification

Stefan Behnel stefan_ml at behnel.de
Tue Jun 2 20:29:57 CEST 2009


Hi,

qhlonline wrote:
> There are instances that when an HTML file has meta tags, the
> charset declared in  tag is not right, because the HTML content next is
> using a different encoding. But lxml will parse accroding to what  said.
> In this situation, it may report error information of error decoding,
> but some times it can parse, and generate a DOM that is not complete.

By default, the HTML parser will ignore errors and try to keep parsing
regardless. Pass "recover=False" if you want to get an exception instead.

Note that character decoding errors cannot always be detected, as they may
lead to valid (although unreadable) characters even when the wrong encoding
is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding.
It will work perfectly well to read a UTF-8 encoded document with a Latin-1
decoder. It just won't give you readable output in most cases.


> eg. I have a WEB file has  while the following content is encoded with
> GBK(which is a Supper set of GB2312). We have got a result with only
> part of the HTML tags parsed out. I wan't to know, if lxml have any
> warning or error information reported for this situation? What it is?

See the error_log property on the parser.

http://codespeak.net/lxml/parsing.html#error-log


> Is there any common
> method? I have also seen some HTML files have  tag attributes "lang", I
> don't know whether this attribute is used in the HTML parsing process.

I don't think so.


> In meta tag like , there are also language statement, But in the
> htmlCheckMeta method of libxml2 library source, I didn't find any
> processing with the http-equiv attribute value "Content-Language".

The "language" is not relevant to the parser. The charset is. Just think of
UTF-8, which can encode any written language that uses characters defined
in Unicode.

Stefan



More information about the lxml-dev mailing list