[lxml-dev] xpath on text nodes

Stefan Behnel stefan_ml at behnel.de
Tue May 12 20:22:12 CEST 2009


Hi,

Jamie Norrish wrote:
> I've included at the end of this message an example of the XML I'm
> operating over, where the aim is to get a rough number of characters of
> textual content preceding and following a name or rs element. Given the
> highly multiform nature of the markup, I *think* that the simplest way
> of going about this is to go from text node to text node, forward and
> back, accumulating the text as it goes, and stopping once a certain
> amount has been reached.
> 
> The way I'm currently doing this is by simply selecting a certain number
> of text nodes preceding and following the name or rs element
> (name_node.xpath('following::text()[position()<15]'), for example), and
> iterating through those and stopping when the right amount of text has
> been accumulated. Obviously this has the problem that too many or too
> few text nodes (in the XPath result sense) may be selected, which is
> either inefficient or leads to too little context.
>
> Selecting an ancestor and then splitting the textual content of that
> isn't, I think, a better option, given the nature of the XML I'm dealing
> with. A name/rs element may be at almost any level of the tree, and its
> textual content may well be repeated multiple times within any given
> chunk.

Ok, I now see where you are coming from. Something like the above XPath
expression or the respective lxml.etree API code would have been my first
attempt, too. I actually doubt that you can do much better in this case.

It's actually a more general problem. Imagine you select a text node that
has a certain length and contains the found text multiple times. How would
you find a good context here? Is it the context of the first occurrence,
which may include a lot of preceding text but not the last occurrence
within the text node itself (if it is long enough) - or is it the last
occurrence that is interesting here, with all the text that follows the
matching text node?

So the underlying problem is even independent of the API you use, it's more
that substrings do not match nicely with the granularity of a text node.


> I totally understand that it's problematic to change lxml to have a
> different model for text, and I'm either going to continue with my
> current method, or else use a modified form of my ideal solution, which
> is to get the parent element of the text, and then use XPath again to
> get the appropriate next text node in the sequence from that. This is a
> little more cumbersome than I'd like, obviously, since the expression
> changes not just by the direction of the context (preceding or
> following) but also whether the current text is the text or tail of the
> element. I'd have to run some tests to see whether the extra processing
> slowed things down too much - this process is one that operates over
> (often) thousands of name elements within each of over a thousand
> documents.

Maybe you should try the same thing without XPath, just using the API.
XPath is fast when you are very selective or when you grab the aggregated
text content of an element. It's less great when you do things iteratively.
The API based algorithm may not even be that complex as you can use tree
iteration and stuff. (Did I mention that readability counts? :)


> (The point of getting this context is to give people some idea of who a
> name element might be referring to, for when it is being keyed to an
> entity in our authority control system. So the markup doesn't matter
> particularly, but the textual content does.)
> 
>Stefan Behnel wrote:
>> I still do not have a clear idea of what you consider "text context"
>> actually. Does that take the tree structure into account (e.g. only within
>> a certain parent element), or is it just any text content that precedes the
>> XPath result in reverse document order, wherever it occurs in the tree?
> 
> Just any, though there are some cases where the markup could be used to
> usefully limit the context (so, for example, the name may occur within a
> bibliographic entry in a list of citations, and it's unlikely that any
> textual content from before or after that entry will be relevant. That's
> typically going to be the exception, however; even staying within a
> paragraph element is not necessarily helpful (named things are often
> introduced at the end of a paragraph and given more context in the
> following paragraph, for example).

This sounds like your algorithm is already more complex than a simple "any
text node preceding the one that matches". That convinces me that an API
based solution will be a lot more flexible than anything you could scratch
out of XPath. It would allow you to special case certain tag types, for
example, or to notice when you cross parent boundaries.


> Here's the example of a small piece of a document, in case it helps.

I'll leave it in the reply, just in case others have ideas, too.


> But really, I'm happy enough with the way lxml works (it's great software -
> thank you and everyone else who has made it what it is!). Not being
> familiar with its inner workings I didn't know whether it would be
> feasible or practical to add XPath to text results. Now I know, and I'll
> continue on without complaint.

:)

Stefan


>           <lb/>give my love to everybody including <name
> key="name-110011" type="person">Peter</name>, hoping he is
>           <lb/>finding his way around the house better now, &amp; that
> this
> 	  <lb/>
> 	  <pb xml:id="n12" n="12" corresp="#JCB-001l"/>
> 	  finds you as it leaves me, in the best of health &amp; very
> 	  <lb/>much in love with you.
> 	</p>
>         <closer>
>           <salute><choice><abbr>Yr</abbr><expan>Your</expan></choice>
> <choice><abbr>affect.</abbr><expan>affectionate</expan></choice> son
> 	  </salute>
>           <lb/>
>           <signed>
>             <name key="name-207379" type="person">J.C. Ulysses
> Beaglehole</name>
>           </signed>
>           <seg type="postscript">P.S. You might tell yourself, <name
> key="name-110417" type="person">Auntie</name> &amp; <name
> key="name-034628" type="person">Christine</name>, that
> 	    <lb/>I have struck nobody yet with so swish a
> 
> <choice><orig>dressing-<lb/>gown</orig><reg>dressing-gown</reg></choice>
> as mine.
> 	    <lb/>I had now better get on to some other letters
> 	    <lb/>of thanks, greeting, business, etc.</seg>
>           <signed><name key="name-207379"
> type="person">J.</name></signed>
>           <seg type="postscript">P.P.S. You might send me the date of
> Auntie <unclear>Sis'</unclear>
> 	    <lb/>birthday. I hope Auntie's had a fitting celebration.</seg>
>           <signed><name key="name-207379" type="person">J.</name>
> 	    <lb/>
> 	  </signed>
>           <seg type="postscript">P.P.P.S. I have been writing all the
> morning &amp; it is now
> 	    <lb/>¼ to 1. If you pass the letter round it will save
> 	    <lb/>much exhaustion to my dexter hand.</seg>
>           <salute>
> 	    <choice><abbr>Yrs</abbr><expan>Yours</expan></choice>
> 	    <del>finally</del> penultimately,
> 	    <lb/>
> 	  </salute>
>           <signed><name key="name-207379"
> type="person">J.C.B.</name></signed>



More information about the lxml-dev mailing list