[lxml-dev] extracting .text strings systematically in unicode

Jean Daniel jeandaniel.browne at gmail.com
Tue Dec 9 13:13:20 CET 2008


Hello,

I am working on a small XML to SQL application. Input attribute values
and text fields usually are unicode but not always. They are fed into
the attributes of an object which only accepts unicode input and raise
an exception if the data is an 'str' instead (said object is a storm
persisted class).

My problem seems to be that lxml extracts text element either as an
'str' or a 'unicode', depending on the text element, as shown on the
code snippets :

from lxml.etree import XML

type( XML('<tag>element</tag>').text )
<type 'str'>

type( XML('<tag>élément</tag>').text )
<type 'unicode'>

So far, it seems that my only choice is to 'cast' every extraction of
the xml doc to unicode, which is cumbersome and does not seem
necessary.  Example :

self.name = unicode( element.get('name') )
for child in element:
        setattr(self, child.tag, unicode( child.text ) )


Is there a switch in the lxml module to make the strings of the xml
document appears predictably as unicode even is the string can be
represented a simple 'str'?

Thank you,


More information about the lxml-dev mailing list