[lxml-dev] special string subclasses for XPath string results

Stefan Behnel stefan_ml at behnel.de
Tue Jan 8 09:26:30 CET 2008


Hi Ian,

Ian Bicking wrote:
> Stefan Behnel wrote:
>> Maybe we should still keep up the str/unicode duality here. Although
>> that will
>> be history with Python 3, it isn't now, and it is an integral part of the
>> current lxml API.
> 
> ElementUnicodeText and ElementStrText?  Not very pretty :-P

Well, fine, but that's how it is. Users will not have to deal with the classes
anyway, they will just be used in the background. isinstance(result, unicode)
will work just like before, as will isinstance(result, basestring).

The difference is just what happens when you call str() on them, or when you
pass them into Python's API, or... Some APIs are ignorant regarding
str/unicode, others are not. We shouldn't deliberately break those that are
not, that would just slow down *everything*.


> Maybe having XPath return values makes it important to be fast.

Not necessarily fast, but it shouldn't slow things down unnecessarily for
stuff that most people won't use. I imagine that the tricky part is the case
where it actually is a (non-ASCII) Unicode value. How do you instantiate a
custom unicode subclass from a UTF-8 char*? You can't use the normal C-API
functions, so I guess you'd have to instantiate a normal unicode object, then
determine its length, and then build the custom subclass for the result length
and copy the string over. That's ugly and it would certainly slow things down.


> It doesn't seem that important that XPath return values be
> particularly light.  And I've found it problematic sometimes that
> non-node XPath return values are just strings (though that's been more
> an issue of attributes, where I'd like to know what attribute or element
> the value belonged to).  OTOH, something like XPath's string() really is
> a string without any place.  So it's all just kind of eclectic and awkward.

There's only so much we can do anyway. If the libxml2 result is a string,
there is no way we can figure out where it came from. So the result of
string() will always be a normal string instance. The only way where we could
change something would be the case where the expression selects a text node or
an attribute (text() and @...). I don't even think we can support ranges here.
They would normally result from the substring() function, right? I doubt that
would return anything but a plain string value.

So to handle the result properly, you could do

  if isinstance(result, basestring):
      if hasattr(result, 'getparent'):
          print result.getparent().tag
          print result.is_text
          print result.is_tail
          print result.is_attribute
      else:
          print result

BTW, I would also add "is_text" in that case. It would seem weird to check
"is_attribute" and "is_tail" before you can determine that it actually is the
common case of a normal ".text" value. (Although "is_text" sounds a bit more
general than what it means here...)

I actually think we are talking about lxml 2.1 stuff here. I'll try to get 2.0
out of the door and then we can see how we could implement these things.

Stefan




More information about the lxml-dev mailing list