[lxml-dev] Problem with xpath behaviour.
Stefan Behnel
stefan_ml at behnel.de
Tue Aug 14 08:37:35 CEST 2007
Bruno Deferrari wrote:
> Hi, I'm using lxml 1.3.3, libxml2 2.6.29 and python 2.5.1, and I'm
> getting a weird behaviour when using xpath, for example:
>
>>>> from lxml import etree
>>>> html = '<html><body>text1<p>text2</p>text3<b>text4</b>text5</body></html>'
>>>> doc = etree.HTML(html)
>>>> etree.tostring(doc.xpath('//p')[0])
> '<p>text2</p>text3'
>
> Shouldn't I be getting just '<p>text2</p>' ?
Take one step back. What "doc.xpath('//p')[0]" returns is an Element with the
tag "p", no children, the text "text2" and the tail "text3". When you
serialise it, it becomes exactly the string you get. If you do not want that
behaviour, consider using the XPath() class and wrapping it with a function
that copies the result element and strips off its tail. Or, wrap tostring()
with a function that ignores the tail of a single element that is passed.
Alternatively, consider using the still-not-released-but-close lxml.html
module for HTML handling. It comes with loads of handy HTML tools and also
provides ways to deal with this 'issue'.
http://codespeak.net/svn/lxml/branch/html/
Stefan
More information about the lxml-dev
mailing list