[lxml-dev] problem\bug in xpath compare() with text in tail

Stefan Behnel stefan_ml at behnel.de
Sat May 24 13:48:11 CEST 2008


Hi,

Matan Ninio wrote:
> This may be a just my (limited) understanding of Xpath and XML, but i'm getting
> a strange problem when I try to use xpath to search for specific strings in a
> file.  specifically, when I use "\\*[compare(text(),"needle")]" to look for
> elements with "needle" in their text, it only works when the strings appears in
> the "text" part, but not when its in the "tail" part.  So:
> 
> <prompt> e=etree.HTML("<html><body>inbody<h5>text</h5>tail</body></html>")
> 
> <prompt> e.xpath("//text()")
> ['inbody', 'text', 'tail']
> 
> <prompt> e.xpath("//*[contains(text(),'text')]//text()")
> ['text']
> 
>  ----  works fine, but
> 
> <prompt> e.xpath("//*[contains(text(),'tail')]//text()")
> []
> 
>  ----  does not.
> 
> is it just that I need to use a different function/attribute for the tail
> (instead  of text())?

The tail text is not inside the element, so it's non-trivial to search for it
in XPath. You can either iterate over all nodes and check .tail yourself, or
do this (untested) to reduce the overhead on the Python side:

    for el in e.xpath("//*[contains(following-sibling::text(),'tail')]"):
        if 'tail' in el.tail:
           ...

Do some testing to find out which is faster for your data.

Stefan


More information about the lxml-dev mailing list