[lxml-dev] Encoding again

Stefan Behnel stefan_ml at behnel.de
Tue Aug 26 08:58:50 CEST 2008


Hi,

Max Ivanov wrote:
>> I can't test it right now, but this might work for you. I just provided
>> the parser with the right encoding information. Note that your "HTML
>> document" does not specify an encoding, so I assume that the parser just
>> expects it to be latin-1 or some other plain byte encoding, and reads the
>> bytes as they come in. To be clear: it's the document that's broken here,
>> not the parser.
> 
> Yes indeed. I understand that document is broken, but that's the case
> - I've to process even broken html pages. Even more, lxml does a lots
> of heavy lifting to make processing of broken html much easier. I'm
> talking about another step in that way. There are a lots of pages in
> russian segment of internet with no charset specified. All of them
> contain lots of symbols with codes > 128. Do you agree that if you
> pass some data, it is reasonable to assume that it would return
> exactly the same data?

What you pass is a byte stream of unknown encoding. What you get back is a
tree with well defined characters. Isn't that great enough?


> Nowdays we have:
> 
> origdata = 'some string with codes >128 (national chars)'
> xml = '<root>'+origdata+'</root>'
> .... parsing it with lxml....
> rettext = doc.text_content()
> isinstance(rettext, unicode) #TRUE! but original text was not unicode.

The "text" you are talking about was a sequence of bytes. Now it is a sequence
of characters. It may not be the sequence you expect, because the document
does not provide any hints about what the characters it describes with its
byte sequences are (how do /you/ know it's really bulgarian characters?), so
they may be Latin-1, they may be UTF-8, they may be Cyrillic, they may be EBCDIC.

I showed you two ways to make it the right sequence of characters in my last
post, in case you have enough information to figure out the encoding with your
own code.


> #ok, converting original text to unicode to compare
> unidata = origdata.decode('original encoding')
> origdata == doc.text_content() #FALSE! lxml makes garbage from our text.

No, it doesn't. It makes well-defined characters from ambiguous bytes. Please
try to understand the difference between an encoded byte sequence and a
Unicode character sequence before you blame tools that deploy Unicode correctly.


> Ok in some cases we could do rettext.encode('iso-8859-1') which
> converts unicode string to single-byte string leaving bytes the same (
> ==unicode string is being read as byte array).

That's a pretty ugly hack, I hope you know that.


> But imagine what would happen if original data contains "&nbsp"
> symbol? In rettext there will be one correct unicode symbol, and when
> we'll try to convert it to single byte string with iso-8859-1 hack it
> will be converted to wrong symbol!

Ah, so you already know that it's an ugly hack. Fine. :)


>> Note that you can also pass unicode strings into the parser, so if you
>> manage to decode your HTML data into correct unicode, the parser will do
>> the right thing.
> That's what I'm trying to do. But first I need to throw out all tags,
> leave only tag content, because charset detector would confuse if
> there is will be lots of ascii symbols and few national symbols.

Then it's not a good-enough encoding detector. You really shouldn't blame the
encoding detector in libxml2 for not being able to detect an ambiguous
encoding, if the tool you prefer fails in the same way.

If you want to remove all tags from the input byte sequence just to detect its
encoding, you can use a regular expression like  b"<[^>]*>".  Should be good
enough for that purpose.

Stefan




More information about the lxml-dev mailing list