[lxml-dev] Text obscured by subelement

Piet van Oostrum piet at cs.uu.nl
Mon Aug 25 00:16:56 CEST 2008


>>>>> John J Lee <jjl at pobox.com> (JJL) wrote:

>JJL> On Sun, 24 Aug 2008, Richard Baron Penman wrote:
>>> 
>>> I have a document with a format like this:
>>> <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
>>> 
>>> I want to extract 'text1text3text5' from <doc> but the text attribute
>>> returns just 'text1'. Here is an example:
>>> 
>>> from lxml import html
>>> doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
>JJL> [...]
>>>> From the example you can see I can get what I want by first dropping the
>>> subelements.
>>> Is there a better way to access this text?
>JJL> [...]

>JJL> I only have 1.3.6 installed, so don't have the HTML support, but you want 
>JJL> to use the .tail of the b elements I think.  With the XML API:

>JJL> from lxml.etree import fromstring
>JJL> doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
>JJL> b1, b2 = doc.getchildren()
>JJL> print doc.text + b1.tail + b2.tail

print doc.text+''.join(c.tail for c in doc.getchildren())
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the lxml-dev mailing list