[lxml-dev] .text_content() should leave spaces. Tests included
Stefan Behnel
stefan_ml at behnel.de
Sat Aug 23 09:32:50 CEST 2008
Hi,
Max Ivanov wrote:
> I've run into another strange behaviour.
I wouldn't call that "strange behaviour". What you want is a new feature.
> lxml.html.HTMLParser
> produces html elements with similair API as Etree elements, but with
> some additions. One of them is .text_content() method. Some quote from
> docs: "Returns the text content of the element, including the text
> content of its children, with no markup."
>
> So according to description it transforms
> "<span>element1</span><span>element2</span>" to "element1element2".
> Notice the lack of space between contents of two elements.
Exactly as in the HTML source, I would say. Given your specific example, I
don't think a browser would display it any different.
> From my
> point of view, that's make this method quite useless, it would be
> better if it produce "element1 element2" from same string. Here is a
> test fro test_htmlparser.py:
>
> def test_html_text_content(self):
> from lxml.html import HTMLParser
> element = self.etree.HTML(self.html_str, parser=HTMLParser())
> self.assertEquals(element.text_content(),"test page title")
That would be wrong, as it alters the content while collecting it. I agree
that a few additional features could help targeting new use cases. For
example, the method could be smart about <br> tags and replace them with "\n".
But that would be optional behaviour enabled by a keyword argument.
Feel free to provide a patch.
Stefan
More information about the lxml-dev
mailing list