[lxml-dev] Dealing with segfaults in lxml?
Mike Meyer
mwm-keyword-lxml.9112b8 at mired.org
Wed Oct 3 22:50:16 CEST 2007
I'm getting crashes - by which I mean the python process is
segfaulting and, with some tweaking of GNU/Linux, leaving me a core
file - while using lxml to parse data.
Versions:
OS: RHEL 5
Python: 2.5.1 (custom built).
lxml: 1.3.3
libxml: 2.6.26 (both compiled and built)
libxslt: 1.1.17
[Yes, I know those are a bit out of date, but we had to give our
client host requirements months ago, and those were current at the
time, and changing them is a non-trivial process, and I've already
started on it, but I'd rather not do that if I can avoid it....]
Rebuilding python with OPTS=-g (I set that for the lxml build as
well), I can get a "where" output that points at lxml:
#0 0x00002aaaaf906c3a in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#1 0x00002aaaaf906be7 in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#2 0x00002aaaaf8ebdfe in rename ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
#3 0x00002aaaaf966a5c in findOrBuildNodeNs ()
from /usr/local/lib/python2.5/site-packages/lxml/etree.so
The first problem is that this isn't repeatable. I've got test data
that will make it happen, but I have to feed that data through the
system a few thousand times in. This is part of a database ETL system,
parsing data from the XML to load into the database. If I feed it the
exact same data over and over again, it'll work 9999 times out of ten
thousand - but then fail that ten thousands time with a segfault.
While this might not seem like a big deal, we're planning on
processing hundreds of thousands of documents a day, so we're talking
about having an instance of the process die tens of times a day. So I
sorta need to fix it.
The document is straightforward: it starts with a meta element with a
set of attributes, and then has a lot of data elements, all the same
type, all with the same attributes (give or take an optional one), and
I just use document.xpath to find the elements, and then read off
their attribute values to save to a database load file.
Hints on how to proceed - setting things up so I can use gdb on the
lxml sources, for instance - would be greatly appreciated. If this
looks like a bug that's been fixed if I update one or more libraries,
that would be great information (i.e. - I can use it to get all the
libraries updated). Anything else that you think I oughta know would
be nice as well.
The sample document is almost half a megabyte (and might be
proprietary). If you'd like to look at it, drop me a line.
thanks,
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
More information about the lxml-dev
mailing list