[lxml-dev] non-ascii characters get garbled
Stefan Behnel
stefan_ml at behnel.de
Mon Sep 17 14:41:25 CEST 2007
js wrote:
> The lxml doc [*1] says that
> "You should generally avoid converting XML/HTML data to unicode before
> passing it into the parsers. It is both slower and error prone."
>
> [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings
>
> but my experience is different from that.
Not quite. As you say below, you sometimes get ValueErrors depending on the
page data, so it *is* error prone.
> For example, the following code doesn't bother encoding things
> and leave the work to lxml.etree.
> According to the doc, this is right way, but it does't work
> and you'll got garbled characters. (give it a try)
>
> --------------------------------------------------------------------
> # -*- coding: utf-8 -*-
> from lxml import html as etree
This import makes your code hard to read IMHO. If you use lxml.html, say it.
> url='http://apple.com/kr'
>
> tree = etree.parse(url)
> from pprint import pformat
> for t in tree.xpath('//a[text()]'):
> print t.text_content()
> --------------------------------------------------------------------
Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And
when I collect the text, it looks perfectly reasonable, including strings like
u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544.
\ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 '
This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console.
Are you sure it's the text content and not just the console output on your side?
> Sometimes I got "ValueError: Unicode strings with encoding declaration
> are not supported. "
On the same page? I assume you were referring to a different page here that
probably uses XHTML instead of HTML, right? The above should work for both -
as long as libxml2 can detect the encoding (and if it can't, there's
lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed).
Stefan
More information about the lxml-dev
mailing list