[lxml-dev] Trouble parsing large XML document with ElementTree
Stefan Behnel
stefan_ml at behnel.de
Thu May 22 09:39:14 CEST 2008
Hi,
Sam Kuper wrote:
> Dear lovely lxmlves,
> Yesterday I tried to parse a large file, the Open Directory Project's links
> document, available here <http://rdf.dmoz.org/rdf/content.rdf.u8.gz>. The
> process went like this:
>
> 1) Unzipped the file using 7-zip. No errors reported.
> 2) Renamed the file by adding a .xml extension, mainly so Windows (see my
> spec below) would recognise it as an XML file.
> 3) Had a look at the file in Oxygen's large document viewer. It took a few
> minutes to load, but everything looked shipshape.
> 4) Opened a command prompt, navigated to the directory containing the file,
> and started Python.
> 5) Entered: from lxml import etree
> 6) Entered: doc = open ('content.rdf.u8.xml', 'r')
> 7) Entered: docParsed = etree.parse(doc)
lxml can parse from a gzipped XML file, no need to do step 1) and 6), just do
docParsed = etree.parse('content.rdf.u8.xml.gz')
or even
docParsed = etree.parse('http://rdf.dmoz.org/rdf/content.rdf.u8.gz')
BTW, if you do 6) it should read
doc = open ('content.rdf.u8.xml', 'rb')
mind the 'rb' at the end.
> Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up
> to around 96% (fair enough, it's a big document) and the Windows UI became
> sluggish. It didn't crash, and the RAM usage stabilised around that amount,
> with Windows Task Manager showing well under 10% CPU load from Python.
That means your machine was heavily swapping. The in-memory tree of libxml2 is
much larger than the serialised document itself, so if it doesn't fit into
RAM, parsing the tree into memory will not make you happy, especially not with
64/128MB...
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "lxml.etree.pyx", line 2520, in lxml.etree.parse
> File "parser.pxi", line 1331, in lxml.etree._parseDocument
> File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument
> File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike
> File "parser.pxi", line 850, in
> lxml.etree._BaseParser._parseDocFromFilelike
> File "parser.pxi", line 452, in
> lxml.etree._ParserContext._handleParseResultDoc
> File "parser.pxi", line 536, in lxml.etree._handleParseResult
> File "parser.pxi", line 478, in lxml.etree._raiseParseError
> lxml.etree.XMLSyntaxError: Memory allocation failed : building node
Your operating system stopped allowing it to allocate more memory and it
didn't even crash, it just gave you an exception. Isn't that cool? :)
(although I wouldn't generally rely on that ...)
Stefan
More information about the lxml-dev
mailing list