[lxml-dev] lxml deconding problem caused by <meta> tag specification
qhlonline
qhlonline at 163.com
Wed Jun 3 08:37:57 CEST 2009
Hi,
2009-06-03,"Stefan Behnel" <stefan_ml at behnel.de> 写道:
>Hi,
>
>qhlonline wrote:
>> There are instances that when an HTML file has meta tags, the
>> charset declared in tag is not right, because the HTML content next is
>> using a different encoding. But lxml will parse accroding to what said.
>> In this situation, it may report error information of error decoding,
>> but some times it can parse, and generate a DOM that is not complete.
>
>By default, the HTML parser will ignore errors and try to keep parsing
>regardless. Pass "recover=False" if you want to get an exception instead.
>
>Note that character decoding errors cannot always be detected, as they may
>lead to valid (although unreadable) characters even when the wrong encoding
>is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding.
>It will work perfectly well to read a UTF-8 encoded document with a Latin-1
>decoder. It just won't give you readable output in most cases.
>
>
>> eg. I have a WEB file has while the following content is encoded with
>> GBK(which is a Supper set of GB2312). We have got a result with only
>> part of the HTML tags parsed out. I wan't to know, if lxml have any
>> warning or error information reported for this situation? What it is?
>
>See the error_log property on the parser.
>
>http://codespeak.net/lxml/parsing.html#error-log
>
I have tried to get error information through parser.error_log. most of the log messages are like:"Element script embeds close tag" and I know these error must have been recovered because defaultly the htmlparser have "recover=True". But there ares still some useful informations:
In the <meta> charset caused incomplete-parsing problem when parsing http://www.sina.com/, if It occurs, the log info correspondingly will be :"input conversion failed due to input error, bytes 0xAD 0x5A 0xB6 0xF9".
In another mail I have said the fault when parsing http://www.jiayuan.com/ with target parser. Web of this site has problems, It is encoded with GB18030 but in <meta> it declares utf-8, If my target parser had data function defined, It would report: ?UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 10: unexpected code byte. Then the target parser would jump out and no error_log was got. If I do not define the data function in my target parser, The parser will do well on result, But the error_log infomation will be: "Input is not proper UTF-8, indicate encoding !" . May be this problem has been recoverd by lxml parser itself and so I can get the result. It is only when I have changed the html content of http://www.jiayuan.com/ with setting its <meta content=charset utf-8> as <meta content=charset GB18030>(Where gb18030 is the correct encoding of this HTML file), Then this web is smoothly parsed and no error_log about encoding left.
>> Is there any common
>> method? I have also seen some HTML files have tag attributes "lang", I
>> don't know whether this attribute is used in the HTML parsing process.
>
>I don't think so.
>
>
>> In meta tag like , there are also language statement, But in the
>> htmlCheckMeta method of libxml2 library source, I didn't find any
>> processing with the http-equiv attribute value "Content-Language".
>
>The "language" is not relevant to the parser. The charset is. Just think of
>UTF-8, which can encode any written language that uses characters defined
>in Unicode.
>
>Stefan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/9b3b0e09/attachment-0001.htm
More information about the lxml-dev
mailing list