[lxml-dev] lxml deconding problem caused by <meta> tag specification
qhlonline
qhlonline at 163.com
Tue Jun 2 04:25:56 CEST 2009
Hi, all
There are instances that when an HTML file has meta tags, the charset declared in tag is not right, because the HTML content next is using a different encoding. But lxml will parse accroding to what said. In this situation, it may report error information of error decoding, but some times it can parse, and generate a DOM that is not complete. eg. I have a WEB file has while the following content is encoded with GBK(which is a Supper set of GB2312). We have got a result with only part of the HTML tags parsed out. I wan't to know, if lxml have any warning or error information reported for this situation? What it is? and how can we deal with this kind of fault ? Is there any common method?
I have also seen some HTML files have tag attributes "lang", I don't know whether this attribute is used in the HTML parsing process. In meta tag like , there are also language statement, But in the htmlCheckMeta method of libxml2 library source, I didn't find any processing with the http-equiv attribute value "Content-Language". Is it because that "Content-Language" is not standerd? Is lxml support this attribute? if so , how to deal with the " content="zh-cn" " declaration when it was saying another different language?
yours
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090602/6cc01157/attachment.htm
More information about the lxml-dev
mailing list