[lxml-dev] cssselect and cssutils

Ian Bicking ianb at colorstudy.com
Mon Jan 7 18:50:32 CET 2008


Höke, Christof wrote:
>> ::first-letter is hard because it doesn't match any object in lxml.
>> If it returned a string like "A" it would be very much out of
>> context (e.g., no parent pointer), and it would be hard to do
>> anything useful with it.  To make it useful I think it would
>> require some new stringish object that also looked nodeish (e.g.,
>> had a .getparent() method). Though maybe an object like that should
>> exist; something similar would be needed for representing ranges.
> 
> What came to my mind was the DOM range spec stuff, but it is not
> really finished, is it? I was reading about it some years (!) ago I
> think in the Javascript Definitive Guide but I think it never went
> anywhere really.

It doesn't really matter too much, since lxml isn't that much like the 
DOM.  But the same use cases for the DOM range can apply to lxml.

> :first-letter should actually be element.text[0] I guess (which would
> be a string in lxml currently?), I don't really know the lxml API but
> would it be possible to define a subtype for element.text for this
> case? But you are right, a more general approach would certainly be
> better.

element.text is just a unicode string.  Maybe we could have a method 
like element.text_range(0, 1) that returns a subclass of unicode that 
also happens to know something about its location.  E.g.:

class ElementText(unicode):
     def __new__(cls, text, is_tail, range, parent):
         self = unicode.__new__(cls, text)
         self.is_tail = is_tail
         self.range = range
         self._parent = parent

     def getparent(self):
         return self._parent

     def enclose_in_tag(self, el):
         """
         Enclose this text range in an element, like::

             span = Element('span')
             el.text_range(0, 1).enclose_in_tag(span)
         """
         parent = self.getparent()
         el.text = unicode(self)
         if self.is_tail:
             el.tail = parent.tail[self.range[1]:]
             parent.tail = parent.tail[:self.range[0]]
             index = parent.getparent().index(parent)
             parent.getparent().insert(index+1, el)
         else:
             el.tail = parent.text[self.range[1]:]
             parent.text = parent.text[:self.range[0]]
             parent.insert(0, el)
         self._parent = el
         self.range = (0, len(self))
         self.is_tail = False

All untested, of course.  A real sense of a range would be a bit more 
difficult, as it involves lots of partial elements.  But something like 
this would be necessary to do that work.

Upon further thought, maybe subclassing unicode isn't the right thing -- 
perhaps it should really just wrap a string.  Then perhaps you could 
just have do, say, el.text_range[:5], where e.text_range is a range 
object for all of its text, and you could slice range objects to further 
break them down.  But dealing with changes to the element are tricky. 
It's all a bit tricky ;)

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org


More information about the lxml-dev mailing list