[lxml-dev] non-ascii characters get garbled

js ebgssth at gmail.com
Mon Sep 17 14:02:52 CEST 2007


Hello, list.

The lxml doc [*1] says that
"You should generally avoid converting XML/HTML data to unicode before
passing it into the parsers. It is both slower and error prone."

[*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings

but my experience is different from that.

For example, the following code doesn't bother encoding things
and leave the work to lxml.etree.
According to the doc, this is right way, but it does't work
and you'll got garbled characters. (give it a try)

--------------------------------------------------------------------
# -*- coding: utf-8 -*-
from lxml import html as etree
url='http://apple.com/kr'

tree = etree.parse(url)
from pprint import pformat
for t in tree.xpath('//a[text()]'):
    print t.text_content()
--------------------------------------------------------------------

The next one break the rule and doing all charset conversion for oneself.
This one works great and all charset conversion will succeed.

--------------------------------------------------------------------
# -*- coding: utf-8 -*-
from lxml import html as etree
from urllib2 import urlopen
from StringIO import StringIO

url='http://apple.com/kr'

res = urlopen(url)
html = res.read().decode(res.headers.getparam('charset'))
tree = etree.parse(StringIO(html))
from pprint import pformat
for t in tree.xpath('//a[text()]'):
    print t.text_content()
--------------------------------------------------------------------

But the latter doesn't always work.
Sometimes I got "ValueError: Unicode strings with encoding declaration
are not supported. "

Is this a known issue? If so, how can I get out of this problem?
Are there any workarounds?

I tried to figure out the cause of these and looked over the lxml and
libxml2 's code
but could not find a clue.
(To me this appeared to be not a lxml's problem but libxml2's ,though)

Any information would be greatly appriciated.
Thanks you in advance.


More information about the lxml-dev mailing list