[lxml-dev] One-time memory leak?
Stefan Behnel
stefan_ml at behnel.de
Thu Feb 14 22:02:49 CET 2008
Hi,
Marius Gedminas wrote:
> I've been using libxml2 (before lxml was even created) and I've built
> some infrastructure for catching libxml2 memory leaks in my unit tests.
> Recently I've started using lxml on a completely different project and
> noticed that my old leak watcher was hooked up -- because it reported a
> leak.
>
> This is most likely a false positive (the "leak" happens only once during the
> program's lifetime), but I'd like to understand what exactly happens. I'm
> attaching a short test program that produces this output on my machine:
>
> $ bin/python lxml-memleak.py
> test_libxml2_html: leaked 0 bytes
> test_libxml2_xml: leaked 0 bytes
> test_lxml_html: leaked 9423 bytes
> test_lxml_xml: leaked 9479 bytes
>
> This is in a virtualenv sandbox with lxml 2.0.1 from cheeseshop and
> system-wide libxml2 2.0.30 (plus a security patch or two) from Ubuntu
> Gutsy. Each of those tests was run in a separate Python process to
> avoid contamination.
You're not testing the same thing, though. lxml does all sorts of stuff when
you call etree.HTML(), not just a plain call to the parser.
Also, I have no idea what happens when you use lxml and libxml2 together -
which you still do here, as you call into libxml2 to enable leak debugging.
> Note that if I run the same test more than once, I see no new leaks:
>
> $ bin/python lxml-memleak.py test_lxml_html 3
> test_lxml_html: leaked 9423 bytes
> test_lxml_html: leaked 0 bytes
> test_lxml_html: leaked 0 bytes
>
> which leads me to think this "leak" is in fact harmless on-demand
> initialization of some sort.
That may be so - but I wouldn't sign it without further investigation. :)
> I've tried looking at the lxml source code but gave up in about 30
> seconds. I don't know Cython. I can't tell which is generated code and
> which is the source for that.
The .pyx and .pxi files are what you want to look at first (maybe I should
write up some "how to read the source" docs...)
And don't be afraid of Cython, it's a lot like Python, and there are some
editors (and some ways of life like Emacs) that can display it with colourful
syntax highlighting.
> I cannot find the entry point that would
> let me trace how lxml.etree.HTML() is implemented ("HTML" is a pretty
> ungreppable string).
There is a file called "lxml.etree.pyx", which is the main module. It contains
the main API implementation. However, the HTML() function will quickly jump
into "_parseMemoryDocument(...)", which is implemented at the end of the
"parser.pxi" file. The call line then continues up to a call to
_BaseParser._parseDoc(), where the actual parsing step is implemented.
> ltrace'ing a Python process failed to notice any
> dynamic library calls to libxml2's functions.
Just guessing, but maybe that's because it needs to trace calls from a library
dynamically loaded by Python?
> How can I translate the short lxml code snippet
>
> from lxml.etree import HTML
> doc = HTML(sample_document)
> del doc
>
> to low-level libxml2 library function calls and see where it allocates
> the extra memory?
Hmmm, that's three lines of Python, but there really is a lot happening behind
the scenes, so that's harder to answer than you might think. I don't know if
you noticed, but parsers can do a lot of weird stuff in lxml, and whenever you
parse a byte string in any of your threads, it will end up in that function in
one way or another...
I guess it would actually be easiest if you could get ltrace to work with a
Python extension module...
> I could, of course, declare lxml to be leak-free and just disable my
> leak finder, but I cannot resist the opportunity to make sure of it (and
> for that I need a leak detector without false positives).
I totally find that a good idea. Normally, I use valgrind for leak debugging,
but having something that you could switch on and off around a unit test would
be just perfect.
I'd say the best way would be to add a debugging module to lxml that would
just call into the libxml2 debugging API.
Stefan
More information about the lxml-dev
mailing list