[lxml-dev] Some XPath questions...

Stefan Behnel stefan_ml at behnel.de
Mon Jul 2 10:32:36 CEST 2007


Hi Ian,

just to comment on your actual first post in this thread, which I kinda
oversaw because of the later discussion.

I think this is pretty cool stuff and I love to have this in lxml. The html
module really seems to be getting somewhere. I think we shouldn't even wait
too long with a release so that we get some more feedback on the new APIs.
Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and
not only alpha, beta, final).


Ian Bicking wrote:
> div:contains('celia') -- means a div where the textual content has the 
> word 'celia' in it, case insensitive.  At least, I think it's case 
> insensitive -- the CSS spec is annoyingly vague, but implementations 
> seem to work like this.  I translate this to:
> 
>    descendant-or-self::div[contains(css:lower-case(string(.)), 'celia']
> 
> I added the lower-case function like:
> 
>    def _make_lower_case(context, s):
>        return s.lower()
>    etree.FunctionNamespace("css")['lower-case'] = _make_lower_case

"css" is not the namespace, it's the prefix. You can do this:

   ns = etree.FunctionNamespace("http://my/css/namespace")
   ns.prefix = "css"
   ns['lower-case'] = _make_lower_case

or this:

   ns = etree.FunctionNamespace("http://my/css/namespace")
   ns['lower-case'] = _make_lower_case

   def css_to_xpath(css):
       xpath = build_xpath(css)
       return etree.XPath(xpath, {'css' : "http://my/css/namespace"})

You should consider providing a default namespace map here, and maybe even
return compiled XPath objects, i.e. callables. Note that these provide a
"path" attribute that returns the original path, so if you have to extend an
expression later on, you can still do so by creating a new XPath object.

Note that this would also allow you to wrap the function with an additional
call to set(), so that or-ed results really become the union and not the sum
of all parts.


> But XPath gives so few errors that it's hard to tell if it's really 
> working.

Sadly, there doesn't seem to be a simple way to find out that a function was
undeclared. Or maybe I'll just have to look back into that... didn't I do that
already? :)


> There's also
> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among 
> siblings with patterns like "2" (second sibling), "3n" (every third 
> element), "odd" (odd elements) and some other selections.  I kind of see 
> how to deal with this using position(), but I'm not sure how to do 
> either nth-of-type or nth-child (and the ones I do understand I am also 
> vague about).

If I understand this correctly, this would be

  nth-of-type: //*/NAME[position() = x]
  nth-child:   //*/*[position() = x]

To deal with things like "2n", try this:

    //*/NAME[(position() mod 2) = 0]


> I've committed the incomplete code in lxml.html.css

I skipped through it a bit and found it really cool. I'm not completely
satisfied with the naming, but I now see that the context of the css module
makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and
providing a top-level class XPath() makes me think it should return an
etree.XPath object, i.e. a compiled path.

One more note:

  def run_xpath(doc, xpath):
      return [el for el in doc.xpath(xpath)
              if isinstance(el, etree.ElementBase)]

Do you mean "etree.iselement(el)" here or are you intentionally restricting
this to real-element subclasses of _Element? (i.e. no plain lxml.etree
elements, no PIs, no comments)

I actually think this module merits its own top-level placing, not necessarily
only as part of lxml.html. It could just as well become "lxml.css", and should
thus not rely too much on a specific API from lxml.html.

Stefan




More information about the lxml-dev mailing list