[lxml-dev] Target parser parsing error
Stefan Behnel
stefan_ml at behnel.de
Thu Jun 4 08:55:11 CEST 2009
Hi,
qhlonline wrote:
> 2009-06-03,"Stefan Behnel" wrote:
>> The libxml2 parser will parse the byte stream and try
>> to convert it to UTF-8. If that fails but it is asked to "recover" from
>> it, it will just continue without raising an error. Not sure what
>> becomes of the data in this case, but apparently there is no guarantee
>> that the invalid bytes that were parsed up to this point get stripped.
>
> I agree with you. I have thought about what libxml2 would do when an
> illegal character came. Your answer makes me clear at this point.
Then its clearer to you than to me. I'm actually not convinced yet that
this is the case. I was rather guessing based on my (limited) knowledge
about the problem you observe, which I have never observed myself in the
wild. The parser of libxml2 uses leveled buffers that copy the data during
decoding. That may already be a sufficient barrier against such problems.
What about posting a self-contained and stripped-down to the minimum Python
module that shows the unexpected behaviour? Nothing that accesses the
internet or something, just embed a sufficient part of a failing web page
as a string (possibly base64 encoded). That way, others could try to
reproduce the problem on their side and debug it.
>> The second level is where lxml comes into the play. When you define a
>> "data()" method on your target parser, you ask lxml to pass you the
>> character data from the document. lxml's SAX handler will then try to
>> decode the UTF-8 data provided by the libxml2 parser to pass it into your
>> method. If the data returned by the parser is not valid UTF-8, this will
>> fail. I assume that this is where the exception that you see originates
>> from, as this is done through the Python Codec API.
>
> Yes, That is the case. But the illegal character came out side of lxml
> and outside of libxml2, The whole string was got from an URL by using
> urllib module in python. So, I wonder whether there were some other method
> to get HTML content from URL without illegal characters.
Well, as I said before: if the HTML is broken, there is no way to make sure
the parser can read all data 'correctly' (whatever that means in this
context). If the web page adheres to an encoding and just fails to declare
it correctly, your best bet is to decode the page into a unicode string
yourself, catch and handle any decoding errors in a suitable way, and pass
that unicode string into the parser.
Stefan
More information about the lxml-dev
mailing list