[lxml-dev] Unicode oddness

Stefan Behnel stefan_ml at behnel.de
Wed Apr 8 12:12:30 CEST 2009


Adam wrote:
> I have a utf-8 encoded string with html containing the word 'Français':
> >>> html = '<html><head><title>Fran\xc3\xa7ais</title></head></html>'
>
> I feed it to lxml.html:
> >>> root = lxml.html.fromstring(html)
>
> When I get the text from lxml, it is a unicode string, but it has not been
> decoded!:
> >>> root.text_content()
> u'Fran\xc3\xa7ais'

Your HTML snippet lacks a <meta> tag, so the HTMLParser has no way of
knowing what encoding your HTML snippet uses. It therefore falls back to
assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite
happy about this default.

If you know the encoding in advance, you can create your own parser
instance and pass it the "encoding" keyword option. There are tools that
can try to detect an encoding from a string that you pass in, e.g.
chardet. It is, however, impossible for any tool in the world to always
recover the missing encoding information for all possible data.

Stefan



More information about the lxml-dev mailing list