[lxml-dev] extracting .text strings systematically in unicode

Stefan Behnel stefan_ml at behnel.de
Tue Dec 9 19:00:09 CET 2008


Hi,

Jean Daniel wrote:
> Is there a switch in the lxml module to make the strings of the xml
> document appears predictably as unicode even is the string can be
> represented a simple 'str'?

No, that's the way ElementTree works (and lxml is ET compatible). This is
mainly for performance reasons, since ASCII strings are extremely common in
XML. Creating a plain ASCII str is more memory efficient and a lot faster
than creating a unicode object, and in Py2 it behaves the same in almost
all situations (except in APIs that specifically test for unicode objects
as input).

You can either switch to Py3.0 where lxml always returns unicode strings,
or you can stick to casting the string yourself. BTW, it's faster to do

	u""+s

than to do

	unicode(s)

although it might be considered less readable. It has the advantage of
raising an exception for non-strings, though.

Stefan



More information about the lxml-dev mailing list