[lxml-dev] Dealing with segfaults in lxml?

Stefan Behnel stefan_ml at behnel.de
Mon Oct 8 21:58:09 CEST 2007


Hi,

sorry for the late reply, I was on vacation last week and am just catching up
with my e-mail.


Mike Meyer wrote:
> I'm getting crashes - by which I mean the python process is
> segfaulting and, with some tweaking of GNU/Linux, leaving me a core
> file - while using lxml to parse data.
> 
> Versions:
> 
> OS: RHEL 5
> Python: 2.5.1 (custom built).
> lxml: 1.3.3
> libxml: 2.6.26 (both compiled and built)
> libxslt: 1.1.17
> 
> Yes, I know those are a bit out of date

They should work, though.


> Rebuilding python with OPTS=-g (I set that for the lxml build as
> well), I can get a "where" output that points at lxml:
> 
> 
> #0  0x00002aaaaf906c3a in rename ()
>    from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #1  0x00002aaaaf906be7 in rename ()
>    from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #2  0x00002aaaaf8ebdfe in rename ()
>    from /usr/local/lib/python2.5/site-packages/lxml/etree.so
> #3  0x00002aaaaf966a5c in findOrBuildNodeNs ()
>    from /usr/local/lib/python2.5/site-packages/lxml/etree.so
>
> The first problem is that this isn't repeatable. I've got test data
> that will make it happen, but I have to feed that data through the
> system a few thousand times in. This is part of a database ETL system,
> parsing data from the XML to load into the database. If I feed it the
> exact same data over and over again, it'll work 9999 times out of ten
> thousand - but then fail that ten thousands time with a segfault.

Are those the real numbers? The 10000, I mean? That would explain a *lot*.

lxml.etree currently has a hard limit for namespace prefix generation (the
"nsXX" bit), which happens to be (an arbitrary) 10000 *per document*.
Admittedly, the resulting behaviour is far from robust and you seem to have
triggered a case where this number matters. I attached a patch (against the
trunk) that switches the counter to a Python long instead, which is only bound
by available memory.


> The document is straightforward: it starts with a meta element with a
> set of attributes, and then has a lot of data elements, all the same
> type, all with the same attributes (give or take an optional one), and
> I just use document.xpath to find the elements, and then read off
> their attribute values to save to a database load file.
>
> Hints on how to proceed - setting things up so I can use gdb on the
> lxml sources, for instance - would be greatly appreciated.

A way to work around this is to *not* reuse documents. You mention a "meta
element", so I guess you use a single document and keep adding namespaced
elements to it. That lets the counter overflow, as the namespaces must declare
and adapt their prefixes when being added to an existing document. You can
print the "prefix" attribute of elements to see how the numbers go up. I don't
know your code, so I can't be more specific. Please ask back if you need any
further hints what you can do to avoid this in general.

Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: python-namespace-prefix-counter.patch
Type: text/x-diff
Size: 2622 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20071008/babff101/attachment-0001.bin 


More information about the lxml-dev mailing list