[lxml-dev] Some benchmarks

Stefan Behnel stefan_ml at behnel.de
Mon Mar 10 12:01:41 CET 2008


Hi Ian,

Ian Bicking wrote:
> For the curious, I've attached some benchmarks.  These are preliminary,
> I'm putting together the numbers for my HTML talk at PyCon.

Those /are/ pretty impressive numbers. Go, get some lxml ads up on PyCon. :)


> One thing that I'd like to test is the memory use for documents.  To do
> this I'm parsing about 4.5Mb of documents and keeping them in memory,
> and looking at the VSZ/RSS sizes reported by ps before and after.  I
> don't think this is the right/best way to do this.  For instance,
> transient memory use by some parsers makes Python grab a bunch of
> memory, but it might be free after parsing, and usable for other things.
>  Also, I don't know if VSZ/RSS is valid at all.  I get the impression it
> isn't that valid.  And the increases I'm seeing for lxml don't seem to
> be sufficient; at least the process should grow by 4.5Mb, right? lxml
> can't be that much more efficient than the serialized form of these files.

:) Didn't you see the code snippet in lxml's parser that sneaks all documents
into dark memory?

I noticed that you calculate the initial size /after/ parsing in the
--serialize case. If I move it before that, I get reasonable numbers for lxml:
+17M for 2.5MB of documents on a 32bit machine.

I don't mind having a bit of setup-time memory in those numbers, as the
absolute numbers are dominated by the document size. They very much depend on
your specific documents anyway (amount of text vs. tags, for example). So if
two libraries are close here, either of them might win for a specific input.
And if they are far away, well, then it's obvious enough which is better. A
meg more or less is of no value.


> Another clear indication that we're measuring transient stuff is that
> when using the BeautifulSoup or html5 parser with an lxml document the
> memory increases substantially.  So any ideas on how to test memory
> would be much appreciated.

Somewhat hard to do across libraries. For example, the way the ElementSoup
parser (i.e. BS on lxml) works, is: parse the document with BS, and then
recursively translate the tree into an lxml tree. So you temporarily use about
twice the memory. You'd have to intercept the tree builder process at the end
(before releasing the BS tree) and measure there in order to get the maximum
amount of memory used. I'd run it a couple of times and just watch top while
it's running. That way, you can figure out something close to the maximum
yourself.

On the other hand, I don't know if temporary memory is of that much value for
a comparison. If it takes more space while parsing - so what? You'll likely
keep the document tree in memory much longer than the parsing takes, so that's
the dominating factor.

Stefan



More information about the lxml-dev mailing list