[lxml-dev] Text obscured by subelement
John J Lee
jjl at pobox.com
Mon Aug 25 00:03:13 CEST 2008
On Sun, 24 Aug 2008, Richard Baron Penman wrote:
>
> I have a document with a format like this:
> <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
>
> I want to extract 'text1text3text5' from <doc> but the text attribute
> returns just 'text1'. Here is an example:
>
> from lxml import html
> doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
[...]
>> From the example you can see I can get what I want by first dropping the
> subelements.
> Is there a better way to access this text?
[...]
I only have 1.3.6 installed, so don't have the HTML support, but you want
to use the .tail of the b elements I think. With the XML API:
from lxml.etree import fromstring
doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
b1, b2 = doc.getchildren()
print doc.text + b1.tail + b2.tail
John
More information about the lxml-dev
mailing list