[lxml-dev] Encoding again

Max Ivanov ivanov.maxim at gmail.com
Mon Aug 25 22:15:33 CEST 2008


> I can't test it right now, but this might work for you. I just provided
> the parser with the right encoding information. Note that your "HTML
> document" does not specify an encoding, so I assume that the parser just
> expects it to be latin-1 or some other plain byte encoding, and reads the
> bytes as they come in. To be clear: it's the document that's broken here,
> not the parser.

Yes indeed. I understand that document is broken, but that's the case
- I've to process even broken html pages. Even more, lxml does a lots
of heavy lifting to make processing of broken html much easier. I'm
talking about another step in that way. There are a lots of pages in
russian segment of internet with no charset specified. All of them
contain lots of symbols with codes > 128. Do you agree that if you
pass some data, it is reasonable to assume that it would return
exactly the same data? Nowdays we have:

origdata = 'some string with codes >128 (national chars)'
xml = '<root>'+origdata+'</root>'
.... parsing it with lxml....
rettext = doc.text_content()
isinstance(rettext, unicode) #TRUE! but original text was not unicode.
#ok, converting original text to unicode to compare
unidata = origdata.decode('original encoding')
origdata == doc.text_content() #FALSE! lxml makes garbage from our text.

xml is all about tags and attribs, why lxml affects content of
elements? It should leave it as is, if it doesn't know what to do with
them ( == there is no charset information, so it is unable to detect
it)

Ok in some cases we could do rettext.encode('iso-8859-1') which
converts unicode string to single-byte string leaving bytes the same (
==unicode string is being read as byte array).

But imagine what would happen if original data contains "&nbsp"
symbol? In rettext there will be one correct unicode symbol, and when
we'll try to convert it to single byte string with iso-8859-1 hack it
will be converted to wrong symbol!

> Note that you can also pass unicode strings into the parser, so if you
> manage to decode your HTML data into correct unicode, the parser will do
> the right thing.
That's what I'm trying to do. But first I need to throw out all tags,
leave only tag content, because charset detector would confuse if
there is will be lots of ascii symbols and few national symbols.
doc.text_content() is an ideal way to do that, but now it is unusable
for this task


More information about the lxml-dev mailing list