[lxml-dev] cssselect and cssutils

Ian Bicking ianb at colorstudy.com
Mon Jan 7 22:15:59 CET 2008


Stefan Behnel wrote:
> Hi Ian,
> 
> Ian Bicking wrote:
>> element.text is just a unicode string.
> 
> or a plain string.
> 
> 
>> Maybe we could have a method
>> like element.text_range(0, 1) that returns a subclass of unicode that
>> also happens to know something about its location.
> 
> I prefer having the XPath string results be something like that. I think
> that's the only case where you can 'spuriously' end up with a text value and
> might want to know where it came from.
> 
> 
>> class ElementText(unicode):
> 
> Maybe we should still keep up the str/unicode duality here. Although that will
> be history with Python 3, it isn't now, and it is an integral part of the
> current lxml API.

ElementUnicodeText and ElementStrText?  Not very pretty :-P

> 
>>     def __new__(cls, text, is_tail, range, parent):
>>         self = unicode.__new__(cls, text)
>>         self.is_tail = is_tail
> 
> Right, 'is_tail' should be in.
> 
> 
>>         self.range = range
> 
> 'range' would be the substring indices? I would prefer calculating as much as
> possible on demand. Remember, most people will not use this object in any
> other way than a plain string. That's why I'm so hesitant about instantiating
> an Element object along the rode.

Range is the slice that is selected, which is necessary for manipulation 
later (like enclose_in_tag).

I'm not proposing this in any way replace text and tail.  These don't 
feel quite like strings to me.  Strings are interchangeable and simple. 
  These are located in a specific place.  If you call text.capitalize(), 
what does that do?  Give you a capitalized view on the text?  Give you a 
new string that loses all sense of place?  It doesn't feel like a string 
at all, which is why I'm not sure it should even subclass from 
unicode/str.  Or, for that matter, get used in lots of different contexts.

Maybe having XPath return values makes it important to be fast.  I'm not 
sure.  It doesn't seem that important that XPath return values be 
particularly light.  And I've found it problematic sometimes that 
non-node XPath return values are just strings (though that's been more 
an issue of attributes, where I'd like to know what attribute or element 
the value belonged to).  OTOH, something like XPath's string() really is 
a string without any place.  So it's all just kind of eclectic and awkward.

> 
>>     def enclose_in_tag(self, el):
>>         """
>>         Enclose this text range in an element, like::
>>
>>             span = Element('span')
>>             el.text_range(0, 1).enclose_in_tag(span)
>>         """
> 
> Hmm, I'll have to think about that one. Not sure what the exact semantics
> should be.

It occurred to me thinking about how you could actually do something 
useful with ::first-letter, like:

def apply_style(doc, selector, style):
     for item in selector(doc):
         if isinstance(item, ElementText):
             el = Element('span')
             item.enclose_in_tag(el)
             item = el
         item.set('style', item.get('style', '') + '; ' + style)

This and the only other use case I currently have in my head for ranges 
(highlighting a selection of the document) would use something like 
enclose_in_tag.

I can't remember what I was doing when I wanted XPath attributes, except 
I think it was matching something like @*, where the attribute name 
mattered but I didn't want to query on it.  I think I ended up selecting 
elements and looping through the attributes instead.  Maybe this was in 
some iteration of the HTML cleaning code.


-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org


More information about the lxml-dev mailing list