[lxml-dev] Some XPath questions...

Ian Bicking ianb at colorstudy.com
Mon Jul 2 19:21:54 CEST 2007


Stefan Behnel wrote:
> Hi Ian,
> 
> just to comment on your actual first post in this thread, which I kinda
> oversaw because of the later discussion.
> 
> I think this is pretty cool stuff and I love to have this in lxml. The html
> module really seems to be getting somewhere. I think we shouldn't even wait
> too long with a release so that we get some more feedback on the new APIs.
> Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and
> not only alpha, beta, final).

Yeah, I was thinking about writing up a summary of things that need to 
be done in the html package; there's still some outstanding stuff, but 
not too much.  The clean module needs to be cleaned up (I'm thinking of 
moving from a function to a class).  I'd like to make the usedoctest 
hack a little more general, as elsewhere I'm now using a similar hack to 
enable ELLIPSIS, and I'd like them not to conflict.  And then some docs, 
but I guess that's it.

> Ian Bicking wrote:
>> div:contains('celia') -- means a div where the textual content has the 
>> word 'celia' in it, case insensitive.  At least, I think it's case 
>> insensitive -- the CSS spec is annoyingly vague, but implementations 
>> seem to work like this.  I translate this to:
>>
>>    descendant-or-self::div[contains(css:lower-case(string(.)), 'celia']
>>
>> I added the lower-case function like:
>>
>>    def _make_lower_case(context, s):
>>        return s.lower()
>>    etree.FunctionNamespace("css")['lower-case'] = _make_lower_case
> 
> "css" is not the namespace, it's the prefix. You can do this:
> 
>    ns = etree.FunctionNamespace("http://my/css/namespace")
>    ns.prefix = "css"
>    ns['lower-case'] = _make_lower_case

OK, I've switched to this.

> or this:
> 
>    ns = etree.FunctionNamespace("http://my/css/namespace")
>    ns['lower-case'] = _make_lower_case
> 
>    def css_to_xpath(css):
>        xpath = build_xpath(css)
>        return etree.XPath(xpath, {'css' : "http://my/css/namespace"})

Is there any advantage to this, over a more global prefix?  I suppose 
there's a possible collision of css:, but I doubt that will be a problem.

> You should consider providing a default namespace map here, and maybe even
> return compiled XPath objects, i.e. callables. Note that these provide a
> "path" attribute that returns the original path, so if you have to extend an
> expression later on, you can still do so by creating a new XPath object.

That's handy.  I was thinking of creating a CSSXPath subclass or 
something, that would keep the original CSS selector around, in addition 
the translated XPath.

> Note that this would also allow you to wrap the function with an additional
> call to set(), so that or-ed results really become the union and not the sum
> of all parts.

If you use | in the XPath expression it seems to work out that there 
won't be any duplicates.

>> But XPath gives so few errors that it's hard to tell if it's really 
>> working.
> 
> Sadly, there doesn't seem to be a simple way to find out that a function was
> undeclared. Or maybe I'll just have to look back into that... didn't I do that
> already? :)

We talked about it previously when I was trying to use match(), and 
instead of errors got bizarre results.  But I don't think it resulted in 
any improvements on error messages.

>> There's also
>> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among 
>> siblings with patterns like "2" (second sibling), "3n" (every third 
>> element), "odd" (odd elements) and some other selections.  I kind of see 
>> how to deal with this using position(), but I'm not sure how to do 
>> either nth-of-type or nth-child (and the ones I do understand I am also 
>> vague about).
> 
> If I understand this correctly, this would be
> 
>   nth-of-type: //*/NAME[position() = x]
>   nth-child:   //*/*[position() = x]
> 
> To deal with things like "2n", try this:
> 
>     //*/NAME[(position() mod 2) = 0]

I think I already have all this working now... though I wish there was a 
test case I could use, as I'm not 100% sure that my tests are testing 
for the correct results.

>> I've committed the incomplete code in lxml.html.css
> 
> I skipped through it a bit and found it really cool. I'm not completely
> satisfied with the naming, but I now see that the context of the css module
> makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and
> providing a top-level class XPath() makes me think it should return an
> etree.XPath object, i.e. a compiled path.

I was thinking about changing around all the public naming.  I'd like 
for it to be a method on elements, though I'm not sure what to call the 
method.  .css(expr) is a bit funny, as it's not "css", it's just a css 
selector.  .select(expr) doesn't say what kind of selector you are using.

Another public function would be like XPath, something that compiles the 
entire CSS expression.  Especially since the CSS parsing is non-trivial 
(just like the XPath parsing is non-trivial), precompiling will be 
beneficial.

I'm thinking of also adding a fast path for a couple common kinds of 
selectors, that translate them more quickly into XPath.  E.g., search 
for r'^\.(\w+)' for class name matches, or '^#(\w+)' for id matches, 
etc.  And there's the question about whether simple CSS selectors should 
be translated to XPath at all (especially when they aren't precompiled). 
  For people that are familiar with CSS selectors, it seems entirely 
possible that they will use it for very simple queries, like 
el.css('div').  If I detect that case and turn it into el.findall('div') 
then it would be completely reasonable; but if it gets tokenized, 
parsed, translated to XPath, compiled, then run, then that's going to be 
pretty inefficient.

Anyway, back to naming -- if there's a method and a function/object to 
compile expressions, that's all the public interface I think it needs. 
I don't think translating css to xpath without compiling is particularly 
important.

> One more note:
> 
>   def run_xpath(doc, xpath):
>       return [el for el in doc.xpath(xpath)
>               if isinstance(el, etree.ElementBase)]
> 
> Do you mean "etree.iselement(el)" here or are you intentionally restricting
> this to real-element subclasses of _Element? (i.e. no plain lxml.etree
> elements, no PIs, no comments)

I wasn't aware of iselement().

I'm not actually sure this is even necessary; I'm not sure if I can ever 
match non-elements with the expressions at all.  I think I put it in 
there at some point when I wasn't sure.  Instead it should probably be 
an assertion in the tests.

> I actually think this module merits its own top-level placing, not necessarily
> only as part of lxml.html. It could just as well become "lxml.css", and should
> thus not rely too much on a specific API from lxml.html.

Yes, you can do selections on anything.  CSS it seems uses | for 
namespaces, like "atom|title", and it doesn't know anything special 
about HTML (except for special handling of the class attribute).

Right now I'm assuming the XPath picks up the prefixes from elsewhere in 
the document.  CSS uses "@namespace prefix URI", but that's part of a 
CSS document, and we're only handling selectors.  So I just translate 
"atom|title" to "//atom:title", and assume it'll work.

The CSS syntax does seem handier for a lot of kinds of selections, and 
after translating them I find the equivalent XPath rather complex in 
some cases (e.g., li:first-child).  So there's some benefit there.


-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list