<div>Hi,<br></div><pre>2009-06-03,"Stefan Behnel" <stefan_ml@behnel.de> 写道:<br>>Hi,<br>><br>>qhlonline wrote:<br>>> There are instances that when an HTML file has meta tags, the<br>>> charset declared in tag is not right, because the HTML content next is<br>>> using a different encoding. But lxml will parse accroding to what said.<br>>> In this situation, it may report error information of error decoding,<br>>> but some times it can parse, and generate a DOM that is not complete.<br>><br>>By default, the HTML parser will ignore errors and try to keep parsing<br>>regardless. Pass "recover=False" if you want to get an exception instead.<br>><br>>Note that character decoding errors cannot always be detected, as they may<br>>lead to valid (although unreadable) characters even when the wrong encoding<br>>is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding.<br>>It will work perfectly well to read a UTF-8 encoded document with a Latin-1<br>>decoder. It just won't give you readable output in most cases.<br>><br>><br>>> eg. I have a WEB file has while the following content is encoded with<br>>> GBK(which is a Supper set of GB2312). We have got a result with only<br>>> part of the HTML tags parsed out. I wan't to know, if lxml have any<br>>> warning or error information reported for this situation? What it is?<br>><br>>See the error_log property on the parser.<br>><br>>http://codespeak.net/lxml/parsing.html#error-log<br>><br>I have tried to get error information through parser.error_log. most of the log messages are like:"<font color="#800000">Element script embeds close tag" </font>and I know these error must have been recovered because defaultly the htmlparser have "recover=True". But there ares still some useful informations:<br> In the <meta> charset caused incomplete-parsing problem when parsing <font color="#800080">http://www.sina.com/</font>, if It occurs, the log info correspondingly will be :"<font color="#800000">input conversion failed due to input error, bytes 0xAD 0x5A 0xB6 0xF9</font>".<br> In another mail I have said the fault when parsing <font color="#800080">http://www.jiayuan.com/</font> with target parser. Web of this site has problems, It is encoded with GB18030 but in <meta> it declares utf-8, If my target parser had <font color="#800080">data</font> function defined, It would report: <font color="#800000"> UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 10: unexpected code byte</font>. Then the target parser would jump out and no error_log was got. If I do not define the <font color="#800080">data</font> function in my target parser, The parser will do well on result, But the error_log infomation will be: "Input is not proper UTF-8, indicate encoding !" . May be this problem has been recoverd by lxml parser itself and so I can get the result. It is only when I have changed the html content of <font color="#800080">http://www.jiayuan.com/</font> with setting its <meta content=charset utf-8> as <meta content=charset GB18030>(Where gb18030 is the correct encoding of this HTML file), Then this web is smoothly parsed and no error_log about encoding left.<br><br>>> Is there any common<br>>> method? I have also seen some HTML files have tag attributes "lang", I<br>>> don't know whether this attribute is used in the HTML parsing process.<br>><br>>I don't think so.<br>><br>><br>>> In meta tag like , there are also language statement, But in the<br>>> htmlCheckMeta method of libxml2 library source, I didn't find any<br>>> processing with the http-equiv attribute value "Content-Language".<br>><br>>The "language" is not relevant to the parser. The charset is. Just think of<br>>UTF-8, which can encode any written language that uses characters defined<br>>in Unicode.<br>><br>>Stefan<br>><br></pre><br><!-- footer --><br><span title="neteasefooter"/><hr/>
<a href="http://qiye.163.com/?ft=1">业务订单流失怎么办?</a>
</span>