[lxml-dev] automatic attribute unicode decode?

Hervé Cauwelier herve.cauwelier at free.fr
Fri Jul 31 17:39:45 CEST 2009


Hi,

I'm quite puzzled by the following excerpt:

>>> from lxml import etree
>>> r = etree.fromstring('<root toto="français" titi="ascii" tata="1"/>'
>>> r.attrib
{'titi': 'ascii', 'toto': u'fran\xe7ais', 'tata': '1'}

In a bare document with no encoding declaration, lxml has decoded itself
a string that did not match the ascii table (what heuristic did it
use?). Now I have three attributes of two different types. I wonder why
the integer was not decoded. ;-)

I actually found this in a real-world document with encoding and
namespaces (An ODF xml part).

Is this a bug to report and how to circumvent it?

Thanks,

Hervé


More information about the lxml-dev mailing list