[lxml-dev] whitespace in lxml.html vs. lxml.html.soupparser

Ian Kallen spidaman at gmail.com
Mon Jan 5 16:44:04 CET 2009


We're using CSSSelector to pull out document fragments. I noticed that
the fragments from lxml.html.soupparser parses don't have extra
whitespace (which is desirable) but fragments from lxml.html has extra
whitespace cruft. For example

w/soupparser:

"""<div class="post"><a name="8720086857907265707"/>
<p/><div/>Josh Bancroft over at <a
href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together
a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup
of stats</a> that matter to bloggers with Google Analytics screen
shots and meaningful context.  The comments are helpful
too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a
href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a
href="http://technorati.com/tag/bloggers"
rel="tag">Bloggers</a>,<br/><a
href="http://technorati.com/tag/blogging" rel="tag">Blogging</a><div/>
</div>"""

w/o soupparser:

"""<div class="post"><a name="8720086857907265707"/>&#13;
    &#13;
    <p/><div/>Josh Bancroft over at <a
href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together
a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup
of stats</a> that matter to bloggers with Google Analytics screen
shots and meaningful context.  The comments are helpful
too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a
href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a
href="http://technorati.com/tag/bloggers"
rel="tag">Bloggers</a>,<br/><a
href="http://technorati.com/tag/blogging"
rel="tag">Blogging</a><div/>&#13;
    </div>&#13;
  &#13;
  &#13;
  &#13;
  &#13;
  &#13;
 &#13;
&#13;"""

Is there a way to get the same output w/o soupparser as with? I'd hate
to resort to post-processing the parses unnecessarily with regexps or
such.

thanks,
-Ian


More information about the lxml-dev mailing list