[lxml-dev] automatic attribute unicode decode?

John Lovell jlovell at nwesd.org
Fri Jul 31 18:42:01 CEST 2009


Hervé:

I keep hearing that LXML defaults to UTF-8 so that is probably the heuristic used.

Good luck,

John W. Lovell
Web Applications Engineer
Northwest Educational Service District
1601 R Avenue
Anacortes, WA 98221
(360) 299-4086
jlovell at nwesd.org
 
www.nwesd.org
Together We Can ...
 

-----Original Message-----
From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Hervé Cauwelier
Sent: Friday, July 31, 2009 8:40 AM
To: lxml-dev at codespeak.net
Subject: [lxml-dev] automatic attribute unicode decode?

Hi,

I'm quite puzzled by the following excerpt:

>>> from lxml import etree
>>> r = etree.fromstring('<root toto="français" titi="ascii" tata="1"/>'
>>> r.attrib
{'titi': 'ascii', 'toto': u'fran\xe7ais', 'tata': '1'}

In a bare document with no encoding declaration, lxml has decoded itself a string that did not match the ascii table (what heuristic did it use?). Now I have three attributes of two different types. I wonder why the integer was not decoded. ;-)

I actually found this in a real-world document with encoding and namespaces (An ODF xml part).

Is this a bug to report and how to circumvent it?

Thanks,

Hervé
_______________________________________________
lxml-dev mailing list
lxml-dev at codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev


More information about the lxml-dev mailing list