[lxml-dev] Dealing with segfaults in lxml?

Mike Meyer mwm-keyword-lxml.9112b8 at mired.org
Tue Oct 9 21:01:56 CEST 2007


On Tue, 09 Oct 2007 12:39:19 +0200 Stefan Behnel <stefan_ml at behnel.de> wrote:

> ok, that wasn't the problem then (it's still good to have it fixed, though).
> Mike Meyer wrote:
> > A master process reads in a a couple of config files, and parses and
> > checks them against a schema, and then possibly plugs in some default
> > attribute values. It then forks two processes:
> > 
> > 1) Uses http to get xml documents from a remote server. These are the
> >    ones I described; they have a meta element and then a data element
> >    containing "row "elements, with the actual values in the attributes
> >    to "row" elements. This process uses iterparse to pull one value
> >    from the meta element, and then saves the entire thing to disk.
> That's the process that fails, right?

No, it's the second process, that uses xpath expressions to find
elements to pull the attribute values from, that fails.

> Can you find out if it's the iterparse() or something else that fails here?

Well, I did try isolating parts of the parsing process. The problem
appears to be in the attribute extraction code.

Basically, I have a routine that I pass an xpath expression to, and a
list of attributes I want values for from those elements. I was being
clever (probably to clever), and letting lxml provide a dictionary,
using dict to make a copy of it (i.e. - the "d = dict(node.attrib)"
line), and then playing game with sets to remove extra keys and add
empty strings for missing attributes. If I just create an empty
dictionary and plug empty strings into it for all the keys, the
problem goes away.

So I rewrote that code with something a bit more straightforward:

   d = dict()
   for key in keys:
       d[key] = node.get(key, '´)

and again, I haven't been able to recreate the problem.

The rest of this is probably irrelevant at this point. I've got code
that appears to be working, and things to try if it doesn't work. If
you'd like to continue chasing this, let me know if there's anything I
can do to help.

> Using valgrind is usually a great way to find out what's going wrong. It will
> make the run a lot slower, but it should print some helpful infos when it
> crashes. Run it like this:
> 
> valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
>     python yourscript.py
> 
> preferably only on the process that crashes.

I've got this. I get errors from the Python parser and oracle
libraries (uninitialized values). Then errors from lxml that look like
the gdb "where" output: it just points through etree.so, but adds that
it's doing an invalid read of size 8 or 4 (didn't have the size
before, but this should cause the segfaults). These all seem to be
followed by an error that says
    Address 0x4D31450 is 8 bytes inside a block of size 120 free'd
And then traces back through vg_replace_malloc.c, then xmlFreeNodeList
in libxml2 a couple of times, and then back to etree.so.

> > architecture makes things a little convoluted, but the basic path is
> > something like:
> > 
> > data = urlopen(....)
> > try:
> >     parsed = fromstring(data.read())
> 
> parse(data) should do, BTW.

Yeah, I know. But the urlopen happens in a different process (and
host, for that matter) than the parsing code. That got lost in the
simplification.

Note that I changed this - I'm actually using the "findall" method,
not the "xpath" method, to find the elements of interest. All values
passed to findall are paths as indicated, though.

> >     if not schema.validate(parsed):
> >        handle_broken_document(parsed=parsed)
> >     for node in parsed.findall('Types/Type'):
> >        d = dict(node.attrib):
> >        save_for_db(d)
> >     for node in parsed.findall('AltTypes/AltType'):
> >        d = dict(node.attrib):
> >        save_for_db(d)
> >     for node in parsed.findall('MoreTypes/MoreType'):
> >        d = dict(node.attrib):
> >        save_for_db(d)
> 
> That's pretty straight forward code, I don't see any risk here. But I'm
> wondering which of the two processes actually fails now - you're presenting
> this one, but from your previous posts I though it was the other one that crashed.
> > I tried turning of the parsing - which pretty much makes everything
> > else do nothing but pass around the raw data - and got no failures. I
> > also tried turning off just the validation, so that the work is still
> > getting done - and got failures.
> Hmmmm, are those failures related to validation errors?

Nope. I have files without validation errors that cause failures,
whereas I haven't caught the one test file that does validate causing
problems.

> Just in case it's the second process that fails (the XPath one), it could be
> worth testing if using the XPath() class instead of the xpath() method works
> better. That might give us a hint on where the problem comes from. It should
> also be faster, BTW.

I should have thought of that myself. Faster is good, so I went ahead
and made this change. Haven't tried it in the dict(node.attrib)
version, though.


-- 
Mike Meyer <mwm at mired.org>		http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.


More information about the lxml-dev mailing list