[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.
Artur Siekielski
artur.siekielski at gmail.com
Thu Nov 29 00:19:59 CET 2007
Hi.
First of all, thanks for a great XML/HTML library! API is really good
thought.
I'm coming here with a problem with HTML doc encoded with UTF-8:
$ cat test_doc.html
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>A*?Ä?ka</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>GdaA*?sk</h1>
</body>
</html>
("title" and "h1" contents are utf-8 strings, decodable to latin2).
From raw Python everything seems to be as expected:
>>> sdata = open('test_doc.html').read()
>>> sdata[219:240]
'<title>\xc5\x81\xc4\x85ka</title>'
>>> udata = unicode(sdata, 'utf-8')
>>> udata[219:240]
u'<title>\u0141\u0105ka</title>\n '
>>> print udata[219:240].encode('latin2')
<title>Łąka</title>
The last statement prints as expected on my console with latin2 charset.
But when using lxml something strange happens:
>>> from lxml import etree
>>> t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
>>> t.getroot()[0][0].text
u'\xc5\x81\xc4\x85ka'
This is strange, because this is a unicode string (as indicated by the
first "u") but it's representation printed to console is the same as raw
bytes from 'sdata' var. I would expect that it should be equal to
contents to 'udata' var. As a consequence converting to latin2 doesn't work:
>>> t.getroot()[0][0].text.encode('latin2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in
position 0: character maps to <undefined>
If it's not an error, please tell me. For now I cannot even find any
reasonable workaround.
I'm using the latest lxml 1.3.6.
Thanks for looking at this problem,
Regards,
Artur
More information about the lxml-dev
mailing list