[lxml-dev] Unicode oddness

Adam dood at zworg.com
Wed Apr 8 11:50:04 CEST 2009


The following seems wrong to me:

I have a utf-8 encoded string with html containing the word 'Français':
>>> html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'

I feed it to lxml.html:
>>> root = lxml.html.fromstring(html)            

When I get the text from lxml, it is a unicode string, but it has not been 
decoded!:
>>> root.text_content()
u'Fran\xc3\xa7ais'

The expected output would be decoded unicode, i.e. the result of:
>>> 'Fran\xc3\xa7ais'.decode('utf-8')
u'Fran\xe7ais'

Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais'

Either of these results would make sense and work for me. But the result is an 
odd confusion of the two. Is this an lxml problem, or have I misunderstood 
something?

Thanks, Adam



More information about the lxml-dev mailing list