[lxml-dev] Possible bug in DOM tree iteration?
Viksit Gaur
vik.list.nutch at gmail.com
Sun Jun 29 11:34:02 CEST 2008
Hi all,
I'm running some tests on a page's DOM tree by assigning each element a
unique identifier and then doing some analysis using this. I use code
similar to,
root = bs.fromstring(txtcontent)
self.pagetree = etree.iterwalk(root, events=("start",))
for event, element in self.pagetree:
element.attrib['uid'] = str(cnt)
cnt = cnt + 1
etc.
However, I notice that when iterating through the DOM, on text such as
the following:
--
<p> This is something here which <b> has some more text here </b> and
<b> repeats here again for this </b> statement and some more text here
that doesn't have any tags at all. </p>
--
The uid is assigned only to the P and the first B, but everything after
is left untouched. So, the word "statement" is never assigned an Id.
Moreover, I'm not sure how to access the rest of the the text under the
P tag. When iterating through the tree, shouldn't the other tags be
included too, as well as the text for the P element should contain ALL
the text in there, including the b tags?
If this is intended behavior, could someone point me to how I could
achieve accessing all the other text under the P tag, as well as the B
tags in there?
Cheers
Viksit
More information about the lxml-dev
mailing list