[lxml-dev] Possible bug in DOM tree iteration?

Viksit Gaur vik.list.nutch at gmail.com
Sun Jun 29 11:34:02 CEST 2008


Hi all,

I'm running some tests on a page's DOM tree by assigning each element a 
unique identifier and then doing some analysis using this. I use code 
similar to,

root = bs.fromstring(txtcontent)
self.pagetree = etree.iterwalk(root, events=("start",))
for event, element in self.pagetree:
             element.attrib['uid'] = str(cnt)
             cnt = cnt + 1

etc.

However, I notice that when iterating through the DOM, on text such as 
the following:

--
<p> This is something here which <b> has some more text here </b> and 
<b> repeats here again for this </b> statement and some more text here 
that doesn't have any tags at all. </p>

--

The uid is assigned only to the P and the first B, but everything after 
is left untouched. So, the word "statement" is never assigned an Id. 
Moreover, I'm not sure how to access the rest of the the text under the 
  P tag. When iterating through the tree, shouldn't the other tags be 
included too, as well as the text for the P element should contain ALL 
the text in there, including the b tags?

If this is intended behavior, could someone point me to how I could 
achieve accessing all the other text under the P tag, as well as the B 
tags in there?

Cheers
Viksit


More information about the lxml-dev mailing list