[lxml-dev] non-ascii characters get garbled

js ebgssth at gmail.com
Mon Sep 17 16:09:02 CEST 2007


Thank you for you reply.

> > --------------------------------------------------------------------
> > # -*- coding: utf-8 -*-
> > from lxml import html as etree
>
> This import makes your code hard to read IMHO. If you use lxml.html, say it.

Oh, html is just a little bit different version of etree so
I always do above import. that's just my thought.
I'll just say what I'll do next time, thanks.
Explicit is better than implicit  :)

> > url='http://apple.com/kr'
> >
> > tree = etree.parse(url)
> > from pprint import pformat
> > for t in tree.xpath('//a[text()]'):
> >     print t.text_content()
> > --------------------------------------------------------------------
>
> Hmm, when I do that, it prints beautiful Asian (Korean?) letters for me. And
> when I collect the text, it looks perfectly reasonable, including strings like
>
>   u'Copyright \xa9 2007 \uc560\ud50c\ucef4\ud4e8\ud130\ucf54\ub9ac\uc544.
> \ubaa8\ub4e0 \uad8c\ub9ac \ubcf4\uc720 \xa0 '
>
> This is on lxml 2.0alpha2 and libxml2 2.6.27, on a Linux UTF-8 console.
>
> Are you sure it's the text content and not just the console output on your side?

This is on lxml 2.0alpha2 and  libxml2 2.6.29_0.
I got the following.

$ ./lxml_test.py
Apple
Store
Mac
iPod + iTunes
Downloads
Support
온라인 ìŠ¤í† ì–´
ì• í"Œ 공인 판매 ëŒ€ë¦¬ì 
사이트 맵
ìµœì‹  소식
문의
ê³ ê° 지원
소비자 ì•ˆì „ì„ 위한 리콜 í"„로그램 - iBook G4 및
PowerBook G4 배터리 교체
í™"ë(c)´ 및 ì „ì› ë¬¸ì œì— 대한 eMac 수리 연장 í"„로그램
here.
사ìš(c) 약관
ê°œì¸ì •ë³´ ë³´í˜¸ì •ì±


This is not a console problem because I can get correct result
by using latter method as I said before.

> > Sometimes I got "ValueError: Unicode strings with encoding declaration
> > are not supported. "
>
> On the same page? I assume you were referring to a different page here that
> probably uses XHTML instead of HTML, right? The above should work for both -
> as long as libxml2 can detect the encoding (and if it can't, there's
> lxml.html.ElementSoup to the rescue, if you have BeautifulSoup installed).

Yes, from different page.
I got the error when I'm getting http://www.hatena.com/

Thanks.


More information about the lxml-dev mailing list