[lxml-dev] Dealing with segfaults in lxml?

Stefan Behnel stefan_ml at behnel.de
Wed Oct 10 09:03:04 CEST 2007


Mike Meyer wrote:
> On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel <stefan_ml at behnel.de> wrote:
>> Can you find out if it's the iterparse() or something else that fails here?
> 
> Well, I did try isolating parts of the parsing process. The problem
> appears to be in the attribute extraction code.
> 
> Basically, I have a routine that I pass an xpath expression to, and a
> list of attributes I want values for from those elements. I was being
> clever (probably to clever), and letting lxml provide a dictionary,
> using dict to make a copy of it (i.e. - the "d = dict(node.attrib)"
> line),

That should work though. You should also be able to safely do

    d = dict(node.items())

or something in that line, which should even be faster as it avoids the
intermediate attrib proxy and iterator creation steps. If you wan to be more
selective, a generator expression will do.


> and then playing game with sets to remove extra keys and add
> empty strings for missing attributes. If I just create an empty
> dictionary and plug empty strings into it for all the keys, the
> problem goes away.
> 
> So I rewrote that code with something a bit more straightforward:
> 
>    d = dict()
>    for key in keys:
>        d[key] = node.get(key, '´)
> 
> and again, I haven't been able to recreate the problem.

Hmmmm, this sounds like a deallocation problem then. Calling .attrib creates a
dict-like Proxy that adds a cyclic reference to the underlying Element, so
this changes the garbage collection behaviour. Things have been going astray a
couple of times already here, as this is really hard to get right for the tons
and tons of possible use cases (involving threading race conditions and what
not). Though I was pretty sure that 1.3.2+ didn't suffer from anything like
that anymore and the attrib stuff should actually have been fixed in 1.2
already AFAIR.


>> Using valgrind is usually a great way to find out what's going wrong. It will
>> make the run a lot slower, but it should print some helpful infos when it
>> crashes. Run it like this:
>>
>> valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
>>     python yourscript.py
>>
>> preferably only on the process that crashes.
> 
> I've got this. I get errors from the Python parser and oracle
> libraries (uninitialized values). Then errors from lxml that look like
> the gdb "where" output: it just points through etree.so, but adds that
> it's doing an invalid read of size 8 or 4 (didn't have the size
> before, but this should cause the segfaults). These all seem to be
> followed by an error that says
>     Address 0x4D31450 is 8 bytes inside a block of size 120 free'd
> And then traces back through vg_replace_malloc.c, then xmlFreeNodeList
> in libxml2 a couple of times, and then back to etree.so.

Then it is a deallocation problem. Apparently, the XML nodes it accesses were
already freed before - that's what's great about valgrind: it tells you what
last happened to the memory that it now fails to access, so you can figure out
why it was freed in the first place.

Could you send me the output?

Stefan


More information about the lxml-dev mailing list