[lxml-dev] special string subclasses for XPath string results
Ian Bicking
ianb at colorstudy.com
Wed Jan 9 19:19:02 CET 2008
Stefan Behnel wrote:
>> It doesn't seem that important that XPath return values be
>> particularly light. And I've found it problematic sometimes that
>> non-node XPath return values are just strings (though that's been more
>> an issue of attributes, where I'd like to know what attribute or element
>> the value belonged to). OTOH, something like XPath's string() really is
>> a string without any place. So it's all just kind of eclectic and awkward.
>
> There's only so much we can do anyway. If the libxml2 result is a string,
> there is no way we can figure out where it came from. So the result of
> string() will always be a normal string instance. The only way where we could
> change something would be the case where the expression selects a text node or
> an attribute (text() and @...). I don't even think we can support ranges here.
> They would normally result from the substring() function, right? I doubt that
> would return anything but a plain string value.
That's kind of why I think performance doesn't matter, because it won't
even come into play most of the time. In relation to XPath, the one
thing I would like is some representation of attributes. There's a
backward compatible issue, but the underlying engine returns attributes
as something different than normal text anyway, right? I think
attributes are mostly a different use case than text ranges.
For something like ::first-letter, I didn't really expect it to be
possible to compile that to XPath. Instead it would have to be
something like:
def first_letter_selector(xpath_expr):
def selector(el):
result = xpath_expr(el)
return result.text_range(0, 1)
return selector
For representing ranges I'd also like some text range (I don't have any
immediate needs, so I'm personally in no rush here). But it's not
something that would have to replace the current text/tail attributes.
It's just that in some code it can be nice to have something similar to
the DOM TextNode, and this kind of provides that (except more nicely I
think, as it would be more like a view).
Then the range might just be like:
class TextRange(object):
def __init__(self, el, range, is_text):
self.el = el
assert range[0] >= 0
self.range = range # (start, end) tuple
self.is_text = is_text
def __unicode__(self):
start, end = self.range
if end == 0:
return ''
if self.is_text:
t = self.el.text
else:
t = self.el.tail
if t is None:
raise ValueError(...)
if not isinstance(t, unicode):
t = unicode(t, 'utf8') #?
if range[1] > len(t):
raise ValueError(
"TextRange %r is invalid (element has been changed?)"
% self)
return t[start:end]
def __repr__(self):
if self.is_text:
meth = 'text_range'
else:
meth = 'tail_range'
return '%r.%s%s' % (self.el, meth, range)
def getparent(self):
return self.el
# and other stuff that might be convenient...
No place where I am currently using text/tail do I really want this kind
of behavior.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
More information about the lxml-dev
mailing list