[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Frederik Elwert felwert at uni-bremen.de
Thu Nov 29 10:42:34 CET 2007


Hi!

Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
> But when using lxml something strange happens:
> 
>  >>> from lxml import etree
>  >>> t = etree.parse(open('test_doc.html'), etree.HTMLParser())
> 
> Now getting title element text:
> 
>  >>> t.getroot()[0][0].text
> u'\xc5\x81\xc4\x85ka'

Did you try it with the h1-Element? Has it the same problem?

I remember some discussions on the list about a similar problem. As far
as I remember, libxml might have problems decoding the title properly,
because the charset hint comes after the title has already been parsed.

But I don't currently know any good workarounds. Maybe somebody else
does, or you have a look at the list archive.

Cheers,
Frederik



More information about the lxml-dev mailing list