[lxml-dev] Some benchmarks

Stefan Behnel stefan_ml at behnel.de
Mon Mar 10 21:15:58 CET 2008


Hi,

Ian Bicking wrote:
> I get 25MB, which seems quite reasonable.  Here's the revised numbers:
> 
>                   VSZ    /  RSS
> lxml           :  25908  /  26232
> bs             :  82508  /  82168
> html5_cet      :  54616  /  54760
> html5_et       :  64688  /  64964
> html5_lxml     :  49056  /  49124
> html5_minidom  : 194352  / 192936
> html5_simple   :  99772  /  98016
> lxml_bs        : 104916  / 104856
> htmlparser     :   4440  /   4448

Still pretty good for lxml. That actually surprises me, cET is more memory
friendly by itself (due to its simpler tree model), so it must be html5lib
that takes its bite here.


> I also tried allocating random strings until the size increased, to see
> if there was lots of allocated but free memory (the unused amount is an
> estimate, as I'm unsure what the exact internal representation of a list
> of strings is).  The results were peculiar:
> 
>                   VSZ       RSS (used)
> lxml           :  26952  /  26211   (unused:     5)
> bs             :  83408  /  82156   (unused:     0)
> html5_cet      :  55640  /  54745   (unused:    19)
> html5_et       :  65712  /  64946   (unused:    14)
> html5_lxml     :  50072  /  48986   (unused:   134)
> html5_minidom  : 195372  / 192914   (unused:    14)
> html5_simple   :  99772  /  97999   (unused:    17)
> lxml_bs        : 104644  /  73037   (unused: 31783)
> htmlparser     :   4448  /   4433   (unused:    19)
> 
> I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots
> of free memory left over at the end.  I am surprised that the others
> don't, at least html5_lxml should be similar I'd think (though I guess
> if you take into account the unused memory then html5_lxml and lxml_bs
> are similar).

That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use
Python's memory management, so memory that is freed by the parser is really
freed to the OS, not just left as a growing interpreter heap.

I think that's the main reason why html5_lxml ends up below html5_cet in your
test. (Please correct me :)


> I don't actually know if BS is better than lxml in parsing... anything.

When I tried it on the generated libxml2 HTML documentation (2.5 MB), BS
crashed with an encoding error, while lxml worked just fine. But you might
argue that libxml2 should be able to parse its own documentation. ;)


> I haven't looked hard (yet, at least).  The example on the ElementSoup
> page parses *slightly* better with BS, but lxml parses it very similarly
> to how html5lib parses it, which I'd consider the better standard.
> html5lib has the advantage of being a kind of standard.
> 
> If I had a good collection of crappy HTML, that would probably be an
> interesting test to see how differently html5lib, BS, and lxml parse it.
>  I'm not sure where to find a good collection like that.  Maybe
> html5lib's tests, I guess.

There seem to be a fair amount of HTML browser compliance test suites on the
web, but I didn't find any test suites for broken HTML at a first glance.


>> ElementSoup
>> parser (i.e. BS on lxml) works, is: parse the document with BS, and then
>> recursively translate the tree into an lxml tree. So you temporarily
>> use about
>> twice the memory. You'd have to intercept the tree builder process at
>> the end
>> (before releasing the BS tree) and measure there in order to get the
>> maximum
>> amount of memory used. I'd run it a couple of times and just watch top
>> while
>> it's running. That way, you can figure out something close to the maximum
>> yourself.
> 
> I'm pretty sure what you end up with after is the maximum use, as Python
> doesn't release memory back to the operating system after its allocated
> it.  (Or at least Python 2.4 doesn't.)  So instead you have a pool of
> memory that Python isn't using, but the OS doesn't know that.  I guess
> the assumption is that if Python never needs to use it again, at least
> the OS can move it to virtual memory.

Again, unfair advantage for lxml.

What about running a shell script in parallel to the parser tests that dumps
the program's current RAM usage to a file as fast as it can. Then run it
through "sort -n -r | head -1" to get the peak and use that?


>> On the other hand, I don't know if temporary memory is of that much
>> value for
>> a comparison. If it takes more space while parsing - so what? You'll
>> likely
>> keep the document tree in memory much longer than the parsing takes,
>> so that's the dominating factor.
> 
> Right, I'm more interested in the memory the finished document takes.
> Intermediate memory use shows up in the performance numbers anyway.
> Though I don't know if all that memory use might also lead to
> fragmentation, slowing down later allocations?  This is beyond my
> understanding of Python performance.

My guess is that there is enough memory overhead involved in a dynamic
language like Python to keep the impact of memory fragmentation on the parser
performance rather low in comparison. But that's just a guess.

Stefan


More information about the lxml-dev mailing list