[lxml-dev] .text_content() should leave spaces. Tests included
Stefan Behnel
stefan_ml at behnel.de
Tue Aug 26 18:11:15 CEST 2008
Hi,
Max Ivanov wrote:
> The way I've implementent text_content() analog. I've no idea abouth
> XPath, so maybe some of checks could be implemented as XPath
> processing instruction. Thats' just scratch to show an idea, no deep
> testing at all but results are ok for me.
>
> inlinetags = [ <tags list from
> http://htmlhelp.com/reference/html40/inline.html> ] #except <br>
Note that there's lxml.html.defs, which should contain what you want here.
Also, by moving the "br" test to the top in your code above, you can just
leave it in the inlinetags set.
> for el in doc.iter():
> if el.text and (el.tag not in self.inlinetags):
> el.text = ''.join((' ',el.text))
> if el.tail and (el.tag not in self.inlinetags):
> el.tail += ' '
> if el.tag == 'br':
> if el.tail and not el.tail.startswith('\n'):
> el.tail = '\n'+el.tail
> else:
> el.tail = '\n'
> el.drop_tag()
You're modifying the tree here, which is inacceptable for a function that
returns a (partial) string serialisation. Apart from that, this seems like a
workable solution to your problem.
Stefan
More information about the lxml-dev
mailing list