[lxml-dev] automatic attribute unicode decode?
John Lovell
jlovell at nwesd.org
Fri Jul 31 18:42:01 CEST 2009
Hervé:
I keep hearing that LXML defaults to UTF-8 so that is probably the heuristic used.
Good luck,
John W. Lovell
Web Applications Engineer
Northwest Educational Service District
1601 R Avenue
Anacortes, WA 98221
(360) 299-4086
jlovell at nwesd.org
www.nwesd.org
Together We Can ...
-----Original Message-----
From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Hervé Cauwelier
Sent: Friday, July 31, 2009 8:40 AM
To: lxml-dev at codespeak.net
Subject: [lxml-dev] automatic attribute unicode decode?
Hi,
I'm quite puzzled by the following excerpt:
>>> from lxml import etree
>>> r = etree.fromstring('<root toto="français" titi="ascii" tata="1"/>'
>>> r.attrib
{'titi': 'ascii', 'toto': u'fran\xe7ais', 'tata': '1'}
In a bare document with no encoding declaration, lxml has decoded itself a string that did not match the ascii table (what heuristic did it use?). Now I have three attributes of two different types. I wonder why the integer was not decoded. ;-)
I actually found this in a real-world document with encoding and namespaces (An ODF xml part).
Is this a bug to report and how to circumvent it?
Thanks,
Hervé
_______________________________________________
lxml-dev mailing list
lxml-dev at codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
More information about the lxml-dev
mailing list