[lxml-dev] Unicode oddness
Adam
dood at zworg.com
Wed Apr 8 11:50:04 CEST 2009
The following seems wrong to me:
I have a utf-8 encoded string with html containing the word 'Français':
>>> html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
I feed it to lxml.html:
>>> root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been
decoded!:
>>> root.text_content()
u'Fran\xc3\xa7ais'
The expected output would be decoded unicode, i.e. the result of:
>>> 'Fran\xc3\xa7ais'.decode('utf-8')
u'Fran\xe7ais'
Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais'
Either of these results would make sense and work for me. But the result is an
odd confusion of the two. Is this an lxml problem, or have I misunderstood
something?
Thanks, Adam
More information about the lxml-dev
mailing list