[lxml-dev] problem\bug in xpath compare() with text in tail
Stefan Behnel
stefan_ml at behnel.de
Sun May 25 12:42:54 CEST 2008
Hi,
while XPath might be considered somewhat off-topic for ElementTree, I find
your question about text() and .tail very on-topic for lxml.
ElementTree does not expose the concept of a "text node" to Python space, so
having them appear in XPath is somewhat ugly. Also, note that the parser may
decide to split long text content or content that contains entities into
multiple text nodes, so "text()" is not even guaranteed to return a text node
that contains the complete ".text" value of a node. That makes it a somewhat
fragile concept in XPath.
If you want to test for .text and .tail reliably, it is easiest to do it in
Python space. Look at the "siblings" example I gave in my first reply.
Note also that most XPath string functions can work on node content, so for
example:
//*[contains(., 'ABC')]
succeeds for any node where 'ABC' exists in the concatenated string value of
the node and its children (but not in the .tail text of the node itself):
>>> e=et.HTML("<html><body>inbody<h5>text</h5>tail</body></html>")
>>> e.xpath("//*[contains(., 'text')]")
[<Element html at b7789374>, <Element body at b77893c4>,
<Element h5 at b7789414>]
>>> e.xpath("//*[contains(., 'tail')]")
[<Element html at b7789464>, <Element body at b778939c>]
Matan Ninio wrote:
> Why dose the behavior of "text()" change to exclude
> tail elements when moving from "//text()" to "//*[contains(text(),'ABC')]"?
> What does the "text()" function *actually* do?
"//text()" will get you /any/ text node in the tree, regardless of its
position. "text()" is a node test that succeeds for all text nodes.
"//*[contains(text(),'ABC')]" will get you the element that has a text node as
direct child that contains the string "ABC". However, apparently, this only
works for the first text node:
>>> e = et.HTML("<html><body>inbody<h5>text</h5>tail</body></html>")
>>> e.xpath("//*[contains(text(), 'tail')]")
[]
>>> e.xpath("//*[contains(text(), 'inbody')]")
[<Element body at b7789324>]
Not sure if this is in line with the XPath spec - might be a problem in
libxml2. Although:
> I can see that if an element
> where to have more then one text value, the meaning of "contains(text()," may be
> unclear.
I would accept that as an explanation. :)
Stefan
More information about the lxml-dev
mailing list