[lxml-dev] Text obscured by subelement

John J Lee jjl at pobox.com
Mon Aug 25 00:03:13 CEST 2008


On Sun, 24 Aug 2008, Richard Baron Penman wrote:
>
> I have a document with a format like this:
> <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
>
> I want to extract 'text1text3text5' from <doc> but the text attribute
> returns just 'text1'. Here is an example:
>
> from lxml import html
> doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
[...]
>> From the example you can see I can get what I want by first dropping the
> subelements.
> Is there a better way to access this text?
[...]

I only have 1.3.6 installed, so don't have the HTML support, but you want 
to use the .tail of the b elements I think.  With the XML API:

from lxml.etree import fromstring
doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
b1, b2 = doc.getchildren()
print doc.text + b1.tail + b2.tail


John



More information about the lxml-dev mailing list