[lxml-dev] non-ascii characters get garbled

js ebgssth at gmail.com
Tue Sep 18 15:56:21 CEST 2007


Hello again.

I downgraded libxml2 from 2.6.29_0 to 2.6.27_0
and re-run the test script.
surprise, Now it all  works as in the lxml doc!

seems newer libxml2 has some problem converting charset.
(2.6.28_1 doesn't work either.)

I'll look at libxml2's source.

Thank you.


On 9/17/07, Stefan Behnel <stefan_ml at behnel.de> wrote:
>
> js wrote:
> > The lxml doc [*1] says that
> > "You should generally avoid converting XML/HTML data to unicode before
> > passing it into the parsers. It is both slower and error prone."
> >
> > [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings
> >
> > but my experience is different from that.
>
> Not quite. As you say below, you sometimes get ValueErrors depending on the
> page data, so it *is* error prone.
>
>
> > For example, the following code doesn't bother encoding things
> > and leave the work to lxml.etree.
> > According to the doc, this is right way, but it does't work
> > and you'll got garbled characters. (give it a try)
> >
> > --------------------------------------------------------------------
> > # -*- coding: utf-8 -*-
> > from lxml import html as etree
>
> This import makes your code hard to read IMHO. If you use lxml.html, say it.
>
>
> > url='http://apple.com/kr'
> >
> > tree = etree.parse(url)
> > from pprint import pformat
> > for t in tree.xpath('//a[text()]'):
> >     print t.text_content()
> > --------------------------------------------------------------------
>
> Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And
> when I collect the text, it looks perfectly reasonable, including strings like
>
>   u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544.
> \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 '
>
> This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console.
>
> Are you sure it's the text content and not just the console output on your side?
>
>
> > Sometimes I got "ValueError: Unicode strings with encoding declaration
> > are not supported. "
>
> On the same page? I assume you were referring to a different page here that
> probably uses XHTML instead of HTML, right? The above should work for both -
> as long as libxml2 can detect the encoding (and if it can't, there's
> lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed).
>
> Stefan
>


More information about the lxml-dev mailing list