[lxml-dev] non-ascii characters get garbled
js
ebgssth at gmail.com
Mon Sep 17 16:09:02 CEST 2007
Thank you for you reply.
> > --------------------------------------------------------------------
> > # -*- coding: utf-8 -*-
> > from lxml import html as etree
>
> This import makes your code hard to read IMHO. If you use lxml.html, say it.
Oh, html is just a little bit different version of etree so
I always do above import. that's just my thought.
I'll just say what I'll do next time, thanks.
Explicit is better than implicit :)
> > url='http://apple.com/kr'
> >
> > tree = etree.parse(url)
> > from pprint import pformat
> > for t in tree.xpath('//a[text()]'):
> > print t.text_content()
> > --------------------------------------------------------------------
>
> Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And
> when I collect the text, it looks perfectly reasonable, including strings like
>
> u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544.
> \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 '
>
> This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console.
>
> Are you sure it's the text content and not just the console output on your side?
This is on lxml 2.0alpha2 and libxml2 2.6.29_0.
I got the following.
$ ./lxml_test.py
Apple
Store
Mac
iPod + iTunes
Downloads
Support
ì¨ë¼ì¸ ì¤í ì´
ì í" ê³µì¸ í매 ë리ì
ì¬ì´í¸ ë§µ
ìµì ìì
문ì
ê³ ê° ì§ì
ìë¹ì ìì ì ìí ë¦¬ì½ í"ë¡ê·¸ë¨ - iBook G4 ë°
PowerBook G4 ë°°í°ë¦¬ êµì²´
í"ë(c)´ ë° ì ì 문ì ì ëí eMac ì리 ì°ì¥ í"ë¡ê·¸ë¨
here.
ì¬ì(c) ì½ê´
ê°ì¸ì ë³´ ë³´í¸ì ì±
This is not a console problem because I can get correct result
by using latter method as I said before.
> > Sometimes I got "ValueError: Unicode strings with encoding declaration
> > are not supported. "
>
> On the same page? I assume you were referring to a different page here that
> probably uses XHTML instead of HTML, right? The above should work for both -
> as long as libxml2 can detect the encoding (and if it can't, there's
> lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed).
Yes, from different page.
I got the error when I'm getting http://www.hatena.com/
Thanks.
More information about the lxml-dev
mailing list