[lxml-dev] Target parser parsing error

qhlonline qhlonline at 163.com
Wed Jun 3 14:35:39 CEST 2009


Hi, all
   There are more informations about my parsing error when I use target parser to parse http://www.jiayuan.com/ . The fatal error reported out is: Input is not proper UTF-8, indicate encoding ! To find the real place where this problem occured, I have tried to convert the HTML string encoding with iconv directly. This time it also report error, and the error character index in string is just the same with my lxml test. Now things are clear that this parsing error is caused by encoding conversion of iconv from utf-8 to utf-8 when there are illegal characters in the source. When I do not define the data function in my target parser, It will paser without error report. Is it means that when I escape the data function , the UTF-8 to UTF-8 conversion is  also escaped ? Or some correct conversion has been done before the call to the data function ?
                                                  yours
     

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/d1bb10ae/attachment.htm 


More information about the lxml-dev mailing list