[lxml-dev] Target parser parsing error

Stefan Behnel stefan_ml at behnel.de
Wed Jun 3 15:34:24 CEST 2009


qhlonline wrote:
>    There are more informations about my parsing error when I use target
> parser to parse http://www.jiayuan.com/ . The fatal error reported out
> is: Input is not proper UTF-8, indicate encoding ! To find the real
> place where this problem occured, I have tried to convert the HTML
> string encoding with iconv directly. This time it also report error,
> and the error character index in string is just the same with my lxml
> test. Now things are clear that this parsing error is caused by
> encoding conversion of iconv from utf-8 to utf-8 when there are illegal
> characters in the source.

What do you mean by "from utf-8 to utf-8" conversion?


> When I do not define the data function in my
> target parser, It will paser without error report. Is it means that
> when I escape the data function , the UTF-8 to UTF-8 conversion is
> also escaped ? Or some correct conversion has been done before the call
> to the data function ?

It just means that the parser has ignored your character content. There
are two levels here. The libxml2 parser will parse the byte stream and try
to convert it to UTF-8. If that fails but it is asked to "recover" from
it, it will just continue without raising an error. Not sure what becomes
of the data in this case, but apparently there is no guarantee that the
invalid bytes that were parsed up to this point get stripped.

The second level is where lxml comes into the play. When you define a
"data()" method on your target parser, you ask lxml to pass you the
character data from the document. lxml's SAX handler will then try to
decode the UTF-8 data provided by the libxml2 parser to pass it into your
method. If the data returned by the parser is not valid UTF-8, this will
fail. I assume that this is where the exception that you see originates
from, as this is done through the Python Codec API.

Does this clear things up?

That said, I could imagine letting the character decoder work around
broken data if the "recover" option is enabled, simply by replacing broken
content with a replacement character. This would improve the recovery
capabilities in your case, without breaking the data any further than it
already is.

Stefan



More information about the lxml-dev mailing list