[lxml-dev] Some XPath questions...

Ian Bicking ianb at colorstudy.com
Fri Jun 29 23:30:11 CEST 2007


Thanks, very helpful.  I'm guessing it was an oversight that you didn't 
copy the list...

Mike Meyer wrote:
> In <468538ED.9060004 at colorstudy.com>, Ian Bicking <ianb at colorstudy.com> typed:
>> I'm trying to implement CSS selectors, by translating them into XPath. 
>> There's some CSS expressions that I'm having a hard time with, so maybe 
>> someone can tell me how they might work.
>>
>> Expression:
>>
>> div:first-child -- means a div element when it is the first child of its 
>> parent.  I.e.:
>>
>>    <li>
>>      <div id="a">...</div>
>>      <div id="b">...</div>
>>    </li>
>>
>> It makes the first div and not the second.
>>
>> I thought this could be:
>>
>>    descendant-or-self::*/div[0]
>>    or... descendant-or-self::*/div[position() = 0]
> 
>> Those two should be equivalent; the second is a bit easier to handle 
>> programmatically.  But it doesn't work (doesn't match anything).
> 
> XPath arrays are 1-indexed, not 0-indexed, so position() will never be
> 0. I understand some version of IE get this wrong as well.
> 
> To pick out all the div elements that are the first child of their
> parent, use:
> 
> //*[position() = 1 and name() = 'div']
> 
> or equivalently:
> 
> descendant-or-self::*[position() = 1 and name() = 'div']

I don't know how I missed the fact they are 1-indexed... I guess it's 
become such an unusual choice these days.  But handy anyway, since CSS 
is also 1-indexed.

>> Another expreesion:
>>
>> div.foo + div -- means a div element that is the immediately next 
>> sibling of a div element with the class .foo.  I would translate this to:
>>
>>    descendant-or-self::div[@class='foo']/following-sibling::div[0]
>>
>> (The class matching is actually a bit more complex, but it doesn't 
>> actually matter to this.)  I'm (a) not sure if this is right, because 
>> maybe it means the next div after the matching div, even if there's 
>> another element in-between, and (b) it doesn't return any results 
>> regardless.
> 
> 
> Your maybe is right - it means the first div after, whether or not
> there are following siblings. You then select the first element from
> that list (or would, if you used a 1 instead of a 0). Same solution:
> the last bit is following-sibling::*[position() = 1 and name() = 'div']
> 
>> Another expression:
>>
>> div:contains('celia') -- means a div where the textual content has the 
>> word 'celia' in it, case insensitive.  At least, I think it's case 
>> insensitive -- the CSS spec is annoyingly vague, but implementations 
>> seem to work like this.  I translate this to:
>>
>>    descendant-or-self::div[contains(css:lower-case(string(.)), 'celia']
>>
>> I added the lower-case function like:
>>
>>    def _make_lower_case(context, s):
>>        return s.lower()
>>    etree.FunctionNamespace("css")['lower-case'] = _make_lower_case
>>
>> But XPath gives so few errors that it's hard to tell if it's really 
>> working.  The XPath expression returns some elements, but not the 
>> correct number from what I can tell.  Especially since when I had a bug 
>> and wasn't lowercasing the second argument (using 'CELIA') it still 
>> returned elements.
> 
> I think you've got the parens in the wrong place - the last close goes
> after 'celia', not the comma.

That was just a typo in the email; copying and pasting directly:

     >>> xpath('E:contains("foo")')
     e[contains(css:lower-case(string(.)), 'foo')]

However, now that I'm writing my own tests it seems fine (I was using 
someone else's tests, and I think they were wrong; though I'm not sure 
-- you'll always get all the parents of an element if you use that, 
since if a child contains text then necessarily all their parents 
contain the same text).

>> There's some other tricky ones I'm not sure about either, though they 
>> seem to be kind of working.  Things like div:only-child (when it's a div 
> 
> //*[name() = 'div' and last() = 1]

This doesn't seem to be working for me:

     >>> xpath('span:only-child')
     *[name() = 'span' and (last() = 1)]

But testing with <div><span></span></div> in the document, I don't get 
anything returned.

These all work now...

>> with no siblings),
> 
>> div:last-child (no next sibling)
> 
> //*[name() = 'div' and position() = last()]
> 
>> div:first-child (no previous sibling)
> 
> Didn't we just coer that one?
>> div:first-of-type (no preceding siblings that are divs
> 
> //div[position() = 1]
> 
>> div:last-of-type (no following siblings that are divs), 
> 
> //div[position() = last()]
> 
>> div:only-of-type (you are probably getting the pattern)
> 
> //div[last() = 1]
> 
>> div:empty (no children, including text, maybe not including whitespace).
> 
> Ouch. let me think about that one.

Yeah, I couldn't figure that one out.  I thought this might work:

     >>> xpath('E:empty')
     e[count(./children::*) = 0 and string(.) = '']

But maybe I don't understand how count() works; this isn't a valid XPath 
expression.

>> There's also 
>> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among 
> 
> Those should be easy with the above examples.
> 
>> siblings with patterns like "2" (second sibling), "3n" (every third 
>> element), "odd" (odd elements) and some other selections.  I kind of see 
>> how to deal with this using position(), but I'm not sure how to do 
>> either nth-of-type or nth-child (and the ones I do understand I am also 
>> vague about).


-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list