[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Artur Siekielski artur.siekielski at gmail.com
Thu Nov 29 18:21:57 CET 2007


Frederik Elwert wrote:
> Hi!
> 
> Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
>> But when using lxml something strange happens:
>>
>>  >>> from lxml import etree
>>  >>> t = etree.parse(open('test_doc.html'), etree.HTMLParser())
>>
>> Now getting title element text:
>>
>>  >>> t.getroot()[0][0].text
>> u'\xc5\x81\xc4\x85ka'
> Did you try it with the h1-Element? Has it the same problem?

Yes, with h1 there is the same error. But I noticed that when I moved 
meta tag with charset declaration before <title>, then all parsing goes 
OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest 
libxml2 from trunk and it's the same)?

I'm parsing 3rd party HTML, so I must find some workaround. Is this good 
solution: parse HTML, change elements sequence in <head>, serialiaze 
document and parse it again ?

Regards,
Artur


More information about the lxml-dev mailing list