[lxml-dev] Encoding again
Max Ivanov
ivanov.maxim at gmail.com
Tue Aug 26 11:09:15 CEST 2008
> What you pass is a byte stream of unknown encoding. What you get back is a
> tree with well defined characters. Isn't that great enough?
In some cases (original text in ASCII) there are well defined
characters, in other cases it is garbage. Why you couldn't just leave
content inside tags as is in case original encoding is unknown and
parser unable to detect it from data (no <meta> tag for example)?
I'm asking just about new keyword argument which disables any
processing over unknown byte streams inside tags. that would make lxml
more usefull in wider situations.
>> origdata = 'some string with codes >128 (national chars)'
>> xml = '<root>'+origdata+'</root>'
>> .... parsing it with lxml....
>> rettext = doc.text_content()
>> isinstance(rettext, unicode) #TRUE! but original text was not unicode.
>
> The "text" you are talking about was a sequence of bytes. Now it is a sequence
> of characters. It may not be the sequence you expect, because the document
> does not provide any hints about what the characters it describes with its
> byte sequences are (how do /you/ know it's really bulgarian characters?), so
> they may be Latin-1, they may be UTF-8, they may be Cyrillic, they may be EBCDIC.
That's what I'm talking about! If nobody knows what content is
actually is then leave it as is, as original byte stream. Why lxml now
suggests that input stream is unicode? nobody tell it about that. If
lxml don't know about encoding then it should just process tags and
attribs and build tree, lxml don't need to know correct encoding to do
that, any unknown data should be leaved untouched! that's simpliest
rule I could ever imagine - if you don't know what is it, and you
don't need it for your task then avoid any processing of that data,
it's up to user how to handle it later.
> I showed you two ways to make it the right sequence of characters in my last
> post, in case you have enough information to figure out the encoding with your
> own code.
All of them need to find out encoding of text before parsing it with
lxml. I'll tell about that later in this message
>
>> #ok, converting original text to unicode to compare
>> unidata = origdata.decode('original encoding')
>> origdata == doc.text_content() #FALSE! lxml makes garbage from our text.
>
> No, it doesn't. It makes well-defined characters from ambiguous bytes. Please
> try to understand the difference between an encoded byte sequence and a
> Unicode character sequence before you blame tools that deploy Unicode correctly.
Why do you call them well-defined characters? What is so "well" about them?
> Then it's not a good-enough encoding detector. You really shouldn't blame the
> encoding detector in libxml2 for not being able to detect an ambiguous
> encoding, if the tool you prefer fails in the same way.
I've find out from various experiments that liblxml2 encoding detector
is based on tags in source (meta, "<? xml encoding.... ?> etc).
chardet which I use is implemented in python so I couldn't feed it
with large ammount of data, because it takes a lot of time then. If I
feed it with just 1-2kb of raw source I could not guarantee that there
would be enough national characters for proper detection. that's why
I'm cleaning out source first, then take small chunk of data and then
feed parser with it.
> If you want to remove all tags from the input byte sequence just to detect its
> encoding, you can use a regular expression like b"<[^>]*>". Should be good
> enough for that purpose.
<script> tag content + comments + removing tags with your regexp =
takes 20% of total processing time. I don't like idea that I need to
spend 20% of time in noop, when lxml have almost everything for that
task.
More information about the lxml-dev
mailing list