[lxml-dev] Encoding again

Stefan Behnel stefan_ml at behnel.de
Tue Aug 26 11:46:46 CEST 2008


Max Ivanov wrote:
>> What you pass is a byte stream of unknown encoding. What you get back is
>> a tree with well defined characters. Isn't that great enough?
> In some cases (original text in ASCII) there are well defined
> characters, in other cases it is garbage. Why you couldn't just leave
> content inside tags as is in case original encoding is unknown and
> parser unable to detect it from data (no <meta> tag for example)?
>
> I'm asking just about new keyword argument which disables any
> processing over unknown byte streams inside tags. that would make lxml
> more usefull in wider situations.

If you provide a patch, we can discuss it. But be warned that it's not
just about adding a new keyword argument.


> If nobody knows what content is
> actually is then leave it as is, as original byte stream.

Parsing XML/HTML is about converting a byte stream to a tree (ok, usually)
and character content (always).


> Why lxml now suggests that input stream is unicode?

It does not suggest anything like that. It expects the input to be a byte
stream and returns a tree with Unicode content.


> any unknown data should be leaved untouched!  that's simpliest
> rule I could ever imagine - if you don't know what is it, and you
> don't need it for your task then avoid any processing of that data,
> it's up to user how to handle it later.

Then how would the parser know that what it currently sees is a '<', for
example, if it didn't parse with a specific encoding in mind? Only
decoding the byte stream can enable the parser to know what is markup and
what isn't, so that the user can handle the non-markup content later, as
you suggest.


>> I showed you two ways to make it the right sequence of characters in my
>> last
>> post, in case you have enough information to figure out the encoding
>> with your own code.
> All of them need to find out encoding of text before parsing it with
> lxml.

This is correct. lxml's primary purpose is not detecting encodings, it's
parsing XML/HTML. There are other tools that are better in detecting
character encodings because they were specifically written for that
prupose. Use the right tool for the job.


>>> #ok, converting original text to unicode to compare
>>> unidata = origdata.decode('original encoding')
>>> origdata == doc.text_content() #FALSE! lxml makes garbage from our
>>> text.
>>
>> No, it doesn't. It makes well-defined characters from ambiguous bytes.
>> Please
>> try to understand the difference between an encoded byte sequence and a
>> Unicode character sequence before you blame tools that deploy Unicode
>> correctly.
> Why do you call them well-defined characters?

Because they are characters that are defined by the Unicode standard.
Bytes are not characters, although some people mix up the two because they
only ever worked with byte encodings.


> chardet which I use is implemented in python so I couldn't feed it
> with large ammount of data, because it takes a lot of time then. If I
> feed it with just 1-2kb of raw source I could not guarantee that there
> would be enough national characters for proper detection.

Then try to find a representative set of small samples, instead of taking
the first few bytes.

Stefan



More information about the lxml-dev mailing list