[lxml-dev] Dealing with segfaults in lxml?
Stefan Behnel
stefan_ml at behnel.de
Tue Oct 9 12:39:19 CEST 2007
Hi,
ok, that wasn't the problem then (it's still good to have it fixed, though).
Mike Meyer wrote:
> A master process reads in a a couple of config files, and parses and
> checks them against a schema, and then possibly plugs in some default
> attribute values. It then forks two processes:
>
> 1) Uses http to get xml documents from a remote server. These are the
> ones I described; they have a meta element and then a data element
> containing "row "elements, with the actual values in the attributes
> to "row" elements. This process uses iterparse to pull one value
> from the meta element, and then saves the entire thing to disk.
That's the process that fails, right?
Can you find out if it's the iterparse() or something else that fails here?
Using valgrind is usually a great way to find out what's going wrong. It will
make the run a lot slower, but it should print some helpful infos when it
crashes. Run it like this:
valgrind --tool=memcheck --leak-check=no --suppressions=valgrind-python.supp \
python yourscript.py
preferably only on the process that crashes.
> I.e. - the only documents that gets reused a lot is the schema, which
> are built, passed to RelaxNG, and then used to validate each of those
> thousands of documents.
That's ok.
> architecture makes things a little convoluted, but the basic path is
> something like:
>
> data = urlopen(....)
> try:
> parsed = fromstring(data.read())
parse(data) should do, BTW.
> if not schema.validate(parsed):
> handle_broken_document(parsed=parsed)
> for node in parsed.xpath('Types/Type'):
> d = dict(node.attrib):
> save_for_db(d)
> for node in parsed.xpath('AltTypes/AltType'):
> d = dict(node.attrib):
> save_for_db(d)
> for node in parsed.xpath('MoreTypes/MoreType'):
> d = dict(node.attrib):
> save_for_db(d)
That's pretty straight forward code, I don't see any risk here. But I'm
wondering which of the two processes actually fails now - you're presenting
this one, but from your previous posts I though it was the other one that crashed.
> I tried turning of the parsing - which pretty much makes everything
> else do nothing but pass around the raw data - and got no failures. I
> also tried turning off just the validation, so that the work is still
> getting done - and got failures.
Hmmmm, are those failures related to validation errors?
Just in case it's the second process that fails (the XPath one), it could be
worth testing if using the XPath() class instead of the xpath() method works
better. That might give us a hint on where the problem comes from. It should
also be faster, BTW.
Stefan
More information about the lxml-dev
mailing list