[lxml-dev] .text_content() should leave spaces. Tests included

Max Ivanov ivanov.maxim at gmail.com
Tue Aug 26 11:19:08 CEST 2008


The way I've implementent text_content() analog. I've no idea abouth
XPath, so maybe some of checks could be implemented as XPath
processing instruction. Thats' just scratch to show an idea, no deep
testing at all but results are ok for me.

inlinetags = [ <tags list from
http://htmlhelp.com/reference/html40/inline.html> ] #except <br>


     for el in doc.iter():
         if el.text and (el.tag not in self.inlinetags):
             el.text = ''.join((' ',el.text))
         if el.tail and (el.tag not in self.inlinetags):
             el.tail += ' '
         if el.tag == 'br':
             if el.tail and not el.tail.startswith('\n'):
                 el.tail = '\n'+el.tail
             else:
                 el.tail = '\n'
             el.drop_tag()


More information about the lxml-dev mailing list