[lxml-dev] non-ascii characters get garbled

Stefan Behnel stefan_ml at behnel.de
Mon Sep 17 14:41:25 CEST 2007


js wrote:
> The lxml doc [*1] says that
> "You should generally avoid converting XML/HTML data to unicode before
> passing it into the parsers. It is both slower and error prone."
> 
> [*1] http://codespeak.net/lxml/parsing.html#python-unicode-strings
> 
> but my experience is different from that.

Not quite. As you say below, you sometimes get ValueErrors depending on the
page data, so it *is* error prone.


> For example, the following code doesn't bother encoding things
> and leave the work to lxml.etree.
> According to the doc, this is right way, but it does't work
> and you'll got garbled characters. (give it a try)
> 
> --------------------------------------------------------------------
> # -*- coding: utf-8 -*-
> from lxml import html as etree

This import makes your code hard to read IMHO. If you use lxml.html, say it.


> url='http://apple.com/kr'
> 
> tree = etree.parse(url)
> from pprint import pformat
> for t in tree.xpath('//a[text()]'):
>     print t.text_content()
> --------------------------------------------------------------------

Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And
when I collect the text, it looks perfectly reasonable, including strings like

  u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544.
\ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 '

This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console.

Are you sure it's the text content and not just the console output on your side?


> Sometimes I got "ValueError: Unicode strings with encoding declaration
> are not supported. "

On the same page? I assume you were referring to a different page here that
probably uses XHTML instead of HTML, right? The above should work for both -
as long as libxml2 can detect the encoding (and if it can't, there's
lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed).

Stefan


More information about the lxml-dev mailing list