[lxml-dev] whitespace in lxml.html vs. lxml.html.soupparser

Stefan Behnel stefan_ml at behnel.de
Tue Jan 6 10:57:18 CET 2009


Hi,

Ian Kallen wrote:
> We're using CSSSelector to pull out document fragments. I noticed that
> the fragments from lxml.html.soupparser parses don't have extra
> whitespace (which is desirable) but fragments from lxml.html has extra
> whitespace cruft. For example
> 
> w/soupparser:
> 
> """<div class="post"><a name="8720086857907265707"/>
> <p/><div/>Josh Bancroft over at <a
> href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together
> a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup
> of stats</a> that matter to bloggers with Google Analytics screen
> shots and meaningful context.  The comments are helpful
> too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a
> href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a
> href="http://technorati.com/tag/bloggers"
> rel="tag">Bloggers</a>,<br/><a
> href="http://technorati.com/tag/blogging" rel="tag">Blogging</a><div/>
> </div>"""
> 
> w/o soupparser:
> 
> """<div class="post"><a name="8720086857907265707"/>&#13;
>     &#13;
>     <p/><div/>Josh Bancroft over at <a
> href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together
> a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup
> of stats</a> that matter to bloggers with Google Analytics screen
> shots and meaningful context.  The comments are helpful
> too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a
> href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a
> href="http://technorati.com/tag/bloggers"
> rel="tag">Bloggers</a>,<br/><a
> href="http://technorati.com/tag/blogging"
> rel="tag">Blogging</a><div/>&#13;
>     </div>&#13;
>   &#13;
>   &#13;
>   &#13;
>   &#13;
>   &#13;
>  &#13;
> &#13;"""
> 
> Is there a way to get the same output w/o soupparser as with?

Both use different parsers (the whole purpose of the soupparser module is
to provide a different parser), and it looks like the BeautifulSoup parser
drops whitespace in your example.


> I'd hate
> to resort to post-processing the parses unnecessarily with regexps or
> such.

No need to go with regexps here, /one/ problem is definitely enough.

In your example, only the Element tails contain whitespace differences, so
this should work:

	for el in html_root.iter():
	    if el.tail and not el.tail.strip():
	        el.tail = ' '

Stefan



More information about the lxml-dev mailing list