[lxml-dev] .text_content() should leave spaces. Tests included
Mike Meyer
mwm-keyword-lxml.9112b8 at mired.org
Sat Aug 23 10:34:36 CEST 2008
On Sat, 23 Aug 2008 11:57:19 +0400
"Max Ivanov" <ivanov.maxim at gmail.com> wrote:
> >> So according to description it transforms
> >> "<span>element1</span><span>element2</span>" to "element1element2".
> >> Notice the lack of space between contents of two elements.
> >
> > Exactly as in the HTML source, I would say. Given your specific example, I
> > don't think a browser would display it any different.
> >
> Maybe <span> examples are not suitable here. but .text_content() on
> "<html><head><title>test</title></head><body><h1>page
> title</h1></body></html>" displaying "testpage title" instead of "test
> page title" is definitely wrong. Imagine what would happen with
> <table> with multiple td's and tr's - it'll transform it to one big
> word without spaces. Do you think that it is correct?. Easiest way
> will be but spaces between content of any two tags and keep all other
> symbols between tags.
Easiest way to what? Fix this broken behavior? But it'll break the
correct behavior where inline tags are used to change the rendering of
elements in a word (like <span color="blue">bl</span><span
color="green">een</span>).
If you want it to look like what a browser might render, you want to
put spaces between block elements but not inline elements. Of course,
whether a particular tag is inline or not can be changed by whatever
style sheets are in use. And title - well, it's contents aren't
rendered in the contents of the page at all. So maybe they should just
vanish?
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
More information about the lxml-dev
mailing list