[lxml-dev] Some XPath questions...
Stefan Behnel
stefan_ml at behnel.de
Mon Jul 2 10:32:36 CEST 2007
Hi Ian,
just to comment on your actual first post in this thread, which I kinda
oversaw because of the later discussion.
I think this is pretty cool stuff and I love to have this in lxml. The html
module really seems to be getting somewhere. I think we shouldn't even wait
too long with a release so that we get some more feedback on the new APIs.
Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and
not only alpha, beta, final).
Ian Bicking wrote:
> div:contains('celia') -- means a div where the textual content has the
> word 'celia' in it, case insensitive. At least, I think it's case
> insensitive -- the CSS spec is annoyingly vague, but implementations
> seem to work like this. I translate this to:
>
> descendant-or-self::div[contains(css:lower-case(string(.)), 'celia']
>
> I added the lower-case function like:
>
> def _make_lower_case(context, s):
> return s.lower()
> etree.FunctionNamespace("css")['lower-case'] = _make_lower_case
"css" is not the namespace, it's the prefix. You can do this:
ns = etree.FunctionNamespace("http://my/css/namespace")
ns.prefix = "css"
ns['lower-case'] = _make_lower_case
or this:
ns = etree.FunctionNamespace("http://my/css/namespace")
ns['lower-case'] = _make_lower_case
def css_to_xpath(css):
xpath = build_xpath(css)
return etree.XPath(xpath, {'css' : "http://my/css/namespace"})
You should consider providing a default namespace map here, and maybe even
return compiled XPath objects, i.e. callables. Note that these provide a
"path" attribute that returns the original path, so if you have to extend an
expression later on, you can still do so by creating a new XPath object.
Note that this would also allow you to wrap the function with an additional
call to set(), so that or-ed results really become the union and not the sum
of all parts.
> But XPath gives so few errors that it's hard to tell if it's really
> working.
Sadly, there doesn't seem to be a simple way to find out that a function was
undeclared. Or maybe I'll just have to look back into that... didn't I do that
already? :)
> There's also
> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among
> siblings with patterns like "2" (second sibling), "3n" (every third
> element), "odd" (odd elements) and some other selections. I kind of see
> how to deal with this using position(), but I'm not sure how to do
> either nth-of-type or nth-child (and the ones I do understand I am also
> vague about).
If I understand this correctly, this would be
nth-of-type: //*/NAME[position() = x]
nth-child: //*/*[position() = x]
To deal with things like "2n", try this:
//*/NAME[(position() mod 2) = 0]
> I've committed the incomplete code in lxml.html.css
I skipped through it a bit and found it really cool. I'm not completely
satisfied with the naming, but I now see that the context of the css module
makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and
providing a top-level class XPath() makes me think it should return an
etree.XPath object, i.e. a compiled path.
One more note:
def run_xpath(doc, xpath):
return [el for el in doc.xpath(xpath)
if isinstance(el, etree.ElementBase)]
Do you mean "etree.iselement(el)" here or are you intentionally restricting
this to real-element subclasses of _Element? (i.e. no plain lxml.etree
elements, no PIs, no comments)
I actually think this module merits its own top-level placing, not necessarily
only as part of lxml.html. It could just as well become "lxml.css", and should
thus not rely too much on a specific API from lxml.html.
Stefan
More information about the lxml-dev
mailing list