[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Artur Siekielski artur.siekielski at gmail.com
Thu Nov 29 00:19:59 CET 2007


Hi.

First of all, thanks for a great XML/HTML library! API is really good 
thought.

I'm coming here with a problem with HTML doc encoded with UTF-8:

$ cat test_doc.html
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
       <title>A*?Ä?ka</title>
       <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
       <h1>GdaA*?sk</h1>
</body>
</html>


("title" and "h1" contents are utf-8 strings, decodable to latin2).
 From raw Python everything seems to be as expected:

 >>> sdata = open('test_doc.html').read()
 >>> sdata[219:240]
'<title>\xc5\x81\xc4\x85ka</title>'
 >>> udata = unicode(sdata, 'utf-8')
 >>> udata[219:240]
u'<title>\u0141\u0105ka</title>\n '
 >>> print udata[219:240].encode('latin2')
<title>Łąka</title>

The last statement prints as expected on my console with latin2 charset. 
But when using lxml something strange happens:

 >>> from lxml import etree
 >>> t = etree.parse(open('test_doc.html'), etree.HTMLParser())

Now getting title element text:

 >>> t.getroot()[0][0].text
u'\xc5\x81\xc4\x85ka'

This is strange, because this is a unicode string (as indicated by the 
first "u") but it's representation printed to console is the same as raw 
bytes from 'sdata' var. I would expect that it should be equal to 
contents to 'udata' var. As a consequence converting to latin2 doesn't work:

 >>> t.getroot()[0][0].text.encode('latin2')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode
     return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in 
position 0: character maps to <undefined>

If it's not an error, please tell me. For now I cannot even find any 
reasonable workaround.
I'm using the latest lxml 1.3.6.

Thanks for looking at this problem,
Regards,
Artur


More information about the lxml-dev mailing list