[lxml-dev] (no subject)

Stefan Behnel stefan_ml at behnel.de
Sun May 4 07:34:50 CEST 2008


Hi,

mharper3 at uiuc.edu wrote:
> I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code:
> 
> <code>
> import lxml.etree
> 
> wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/
> context = lxml.etree.iterparse(wiki_xml_filename, events=("end"))
> for action, elem in context:
>     pass
> </code>
> 
> The crash usually occurs about halfway through the file (around <page>
> 3,000,000) The same code runs on smaller mediawiki xml files (200 mb)
> without error. I only get this error for this very large xml file (in this
> case about 13gb uncompressed). I had no trouble parsing the same file with
> the python standard library sax parser, but it is much slower and I don't
> like its api.
>
> Some of the exceptions are MemoryErrors. The machine running the code has
> 4gb of ram. The kernel does not appear to significantly hit the swap during
> the run.

iterparse() builds a tree in memory, so parsing a 13gb file on a 4gb RAM
machine will fail - *unless* you clean up the parts of the tree that you no
longer need.

Something like

   for action, elem in context:
       if elem.tag == "page":
            # handle page
            elem.clear()
       elif elem.tag in tag_names_of_ancestors_of_page_elements:
            elem.clear()

might work for you.

BTW, you can also parse the gzip compressed file directly, might even be faster.

Stefan


More information about the lxml-dev mailing list