[lxml-dev] Some benchmarks
Ian Bicking
ianb at colorstudy.com
Mon Mar 10 20:12:55 CET 2008
Stefan Behnel wrote:
> I noticed that you calculate the initial size /after/ parsing in the
> --serialize case. If I move it before that, I get reasonable numbers for lxml:
> +17M for 2.5MB of documents on a 32bit machine.
I didn't intend to include the --serialize option, but must have done
so. Though I don't know why they weren't *all* messed up then? Anyway,
I get 25MB, which seems quite reasonable. Here's the revised numbers:
VSZ / RSS
lxml : 25908 / 26232
bs : 82508 / 82168
html5_cet : 54616 / 54760
html5_et : 64688 / 64964
html5_lxml : 49056 / 49124
html5_minidom : 194352 / 192936
html5_simple : 99772 / 98016
lxml_bs : 104916 / 104856
htmlparser : 4440 / 4448
I also tried allocating random strings until the size increased, to see
if there was lots of allocated but free memory (the unused amount is an
estimate, as I'm unsure what the exact internal representation of a list
of strings is). The results were peculiar:
VSZ RSS (used)
lxml : 26952 / 26211 (unused: 5)
bs : 83408 / 82156 (unused: 0)
html5_cet : 55640 / 54745 (unused: 19)
html5_et : 65712 / 64946 (unused: 14)
html5_lxml : 50072 / 48986 (unused: 134)
html5_minidom : 195372 / 192914 (unused: 14)
html5_simple : 99772 / 97999 (unused: 17)
lxml_bs : 104644 / 73037 (unused: 31783)
htmlparser : 4448 / 4433 (unused: 19)
I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots
of free memory left over at the end. I am surprised that the others
don't, at least html5_lxml should be similar I'd think (though I guess
if you take into account the unused memory then html5_lxml and lxml_bs
are similar).
I don't actually know if BS is better than lxml in parsing... anything.
I haven't looked hard (yet, at least). The example on the ElementSoup
page parses *slightly* better with BS, but lxml parses it very similarly
to how html5lib parses it, which I'd consider the better standard.
html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an
interesting test to see how differently html5lib, BS, and lxml parse it.
I'm not sure where to find a good collection like that. Maybe
html5lib's tests, I guess.
>> Another clear indication that we're measuring transient stuff is that
>> when using the BeautifulSoup or html5 parser with an lxml document the
>> memory increases substantially. So any ideas on how to test memory
>> would be much appreciated.
>
> Somewhat hard to do across libraries. For example, the way the ElementSoup
> parser (i.e. BS on lxml) works, is: parse the document with BS, and then
> recursively translate the tree into an lxml tree. So you temporarily use about
> twice the memory. You'd have to intercept the tree builder process at the end
> (before releasing the BS tree) and measure there in order to get the maximum
> amount of memory used. I'd run it a couple of times and just watch top while
> it's running. That way, you can figure out something close to the maximum
> yourself.
I'm pretty sure what you end up with after is the maximum use, as Python
doesn't release memory back to the operating system after its allocated
it. (Or at least Python 2.4 doesn't.) So instead you have a pool of
memory that Python isn't using, but the OS doesn't know that. I guess
the assumption is that if Python never needs to use it again, at least
the OS can move it to virtual memory.
> On the other hand, I don't know if temporary memory is of that much value for
> a comparison. If it takes more space while parsing - so what? You'll likely
> keep the document tree in memory much longer than the parsing takes, so that's
> the dominating factor.
Right, I'm more interested in the memory the finished document takes.
Intermediate memory use shows up in the performance numbers anyway.
Though I don't know if all that memory use might also lead to
fragmentation, slowing down later allocations? This is beyond my
understanding of Python performance.
Ian
More information about the lxml-dev
mailing list