[lxml-dev] .text_content() should leave spaces. Tests included

Stefan Behnel stefan_ml at behnel.de
Sat Aug 23 09:32:50 CEST 2008


Hi,

Max Ivanov wrote:
> I've run into another strange behaviour.

I wouldn't call that "strange behaviour". What you want is a new feature.


> lxml.html.HTMLParser
> produces html elements with similair API as Etree elements, but with
> some additions. One of them is .text_content() method. Some quote from
> docs:  "Returns the text content of the element, including the text
> content of its children, with no markup."
> 
> So according to description it transforms
> "<span>element1</span><span>element2</span>" to "element1element2".
> Notice the lack of space between contents of two elements.

Exactly as in the HTML source, I would say. Given your specific example, I
don't think a browser would display it any different.


> From my
> point of view, that's make this method quite useless, it would be
> better if it produce "element1 element2" from same string. Here is a
> test fro test_htmlparser.py:
> 
>      def test_html_text_content(self):
>          from lxml.html import HTMLParser
>          element = self.etree.HTML(self.html_str, parser=HTMLParser())
>          self.assertEquals(element.text_content(),"test page title")

That would be wrong, as it alters the content while collecting it. I agree
that a few additional features could help targeting new use cases. For
example, the method could be smart about <br> tags and replace them with "\n".
But that would be optional behaviour enabled by a keyword argument.

Feel free to provide a patch.

Stefan


More information about the lxml-dev mailing list