[lxml-dev] (no subject)

mharper3 at uiuc.edu mharper3 at uiuc.edu
Sun May 4 03:17:49 CEST 2008


Hi lxml-dev:

I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code:

<code>
import lxml.etree

wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/
context = lxml.etree.iterparse(wiki_xml_filename, events=("end"))
for action, elem in context:
    pass
</code>

The crash usually occurs about halfway through the file (around <page> 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) without error. I only get this error for this very large xml file (in this case about 13gb uncompressed). I had no trouble parsing the same file with the python standard library sax parser, but it is much slower and I don't like its api.

I'm using libxml2-2.6.32 (also used earlier versions), python 2.5.2, python-lxml 2.0.5 (also tried earlier versions), Kubuntu 8.04 with 2.6.24 kernel (also tested on opensuse 10.3 with earlier kernel).

Some of the exceptions are MemoryErrors. The machine running the code has 4gb of ram. The kernel does not appear to significantly hit the swap during the run.

Here are the errors:

** glibc detected *** python: free(): invalid pointer: 0x08220a15 ***
Aborted

Also:

Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook
    import re, tempfile, traceback
  File "/usr/lib/python2.5/traceback.py", line 241, in <module>
    def print_last(limit=None, file=None):
MemoryError

Original exception was:
Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None

... and also (slightly different)

Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 37, in apport_excepthook
    import re, tempfile, traceback
  File "/usr/lib/python2.5/tempfile.py", line 33, in <module>
    from random import Random as _Random
MemoryError

Original exception was:
Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None

Sometimes I just get 'Segmentation fault' from the shell, and sometimes it just hangs indefinitely.

and finally (cStringIO):

Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/apport_python_hook.py", line 36, in apport_excepthook
    from cStringIO import StringIO
ImportError: /usr/lib/python2.5/lib-dynload/cStringIO.so: failed to map segment from shared object: Permission denied

Original exception was:
Traceback (most recent call last):
  File "minimal.py", line 6, in <module>
    for action, elem in context:
  File "iterparse.pxi", line 390, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:65064)
  File "parser.pxi", line 489, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:47432)
lxml.etree.XMLSyntaxError: None

Any direction on tracking down the source is greatly appreciated!

-- Marc



More information about the lxml-dev mailing list