[lxml-dev] One-time memory leak?

Stefan Behnel stefan_ml at behnel.de
Thu Feb 14 22:02:49 CET 2008


Hi,

Marius Gedminas wrote:
> I've been using libxml2 (before lxml was even created) and I've built
> some infrastructure for catching libxml2 memory leaks in my unit tests.
> Recently I've started using lxml on a completely different project and
> noticed that my old leak watcher was hooked up -- because it reported a
> leak.
>
> This is most likely a false positive (the "leak" happens only once during the
> program's lifetime), but I'd like to understand what exactly happens.  I'm
> attaching a short test program that produces this output on my machine:
> 
>     $ bin/python lxml-memleak.py
>     test_libxml2_html: leaked 0 bytes
>     test_libxml2_xml: leaked 0 bytes
>     test_lxml_html: leaked 9423 bytes
>     test_lxml_xml: leaked 9479 bytes
> 
> This is in a virtualenv sandbox with lxml 2.0.1 from cheeseshop and
> system-wide libxml2 2.0.30 (plus a security patch or two) from Ubuntu
> Gutsy.  Each of those tests was run in a separate Python process to
> avoid contamination.

You're not testing the same thing, though. lxml does all sorts of stuff when
you call etree.HTML(), not just a plain call to the parser.

Also, I have no idea what happens when you use lxml and libxml2 together -
which you still do here, as you call into libxml2 to enable leak debugging.


> Note that if I run the same test more than once, I see no new leaks:
> 
>     $ bin/python lxml-memleak.py test_lxml_html 3
>     test_lxml_html: leaked 9423 bytes
>     test_lxml_html: leaked 0 bytes
>     test_lxml_html: leaked 0 bytes
> 
> which leads me to think this "leak" is in fact harmless on-demand
> initialization of some sort.

That may be so - but I wouldn't sign it without further investigation. :)


> I've tried looking at the lxml source code but gave up in about 30
> seconds.  I don't know Cython.  I can't tell which is generated code and
> which is the source for that.

The .pyx and .pxi files are what you want to look at first (maybe I should
write up some "how to read the source" docs...)

And don't be afraid of Cython, it's a lot like Python, and there are some
editors (and some ways of life like Emacs) that can display it with colourful
syntax highlighting.


> I cannot find the entry point that would
> let me trace how lxml.etree.HTML() is implemented ("HTML" is a pretty
> ungreppable string).

There is a file called "lxml.etree.pyx", which is the main module. It contains
the main API implementation. However, the HTML() function will quickly jump
into "_parseMemoryDocument(...)", which is implemented at the end of the
"parser.pxi" file. The call line then continues up to a call to
_BaseParser._parseDoc(), where the actual parsing step is implemented.


> ltrace'ing a Python process failed to notice any
> dynamic library calls to libxml2's functions.

Just guessing, but maybe that's because it needs to trace calls from a library
dynamically loaded by Python?


> How can I translate the short lxml code snippet
> 
>     from lxml.etree import HTML
>     doc = HTML(sample_document)
>     del doc
> 
> to low-level libxml2 library function calls and see where it allocates
> the extra memory?

Hmmm, that's three lines of Python, but there really is a lot happening behind
the scenes, so that's harder to answer than you might think. I don't know if
you noticed, but parsers can do a lot of weird stuff in lxml, and whenever you
parse a byte string in any of your threads, it will end up in that function in
one way or another...

I guess it would actually be easiest if you could get ltrace to work with a
Python extension module...


> I could, of course, declare lxml to be leak-free and just disable my
> leak finder, but I cannot resist the opportunity to make sure of it (and
> for that I need a leak detector without false positives).

I totally find that a good idea. Normally, I use valgrind for leak debugging,
but having something that you could switch on and off around a unit test would
be just perfect.

I'd say the best way would be to add a debugging module to lxml that would
just call into the libxml2 debugging API.

Stefan


More information about the lxml-dev mailing list