[lxml-dev] .text_content() should leave spaces. Tests included

Max Ivanov ivanov.maxim at gmail.com
Sat Aug 23 09:05:14 CEST 2008


Hi! I've run into another strange behaviour. lxml.html.HTMLParser
produces html elements with similair API as Etree elements, but with
some additions. One of them is .text_content() method. Some quote from
docs:  "Returns the text content of the element, including the text
content of its children, with no markup."

So according to description it transforms
"<span>element1</span><span>element2</span>" to "element1element2".
Notice the lack of space between contents of two elements. From my
point of view, that's make this method quite useless, it would be
better if it produce "element1 element2" from same string. Here is a
test fro test_htmlparser.py:

     def test_html_text_content(self):
         from lxml.html import HTMLParser
         element = self.etree.HTML(self.html_str, parser=HTMLParser())
         self.assertEquals(element.text_content(),"test page title")


More information about the lxml-dev mailing list