[lxml-dev] Trouble parsing large XML document with ElementTree
Sam Kuper
sam.kuper at uclmail.net
Thu May 22 01:52:08 CEST 2008
Dear lovely lxmlves,
Yesterday I tried to parse a large file, the Open Directory Project's links
document, available here <http://rdf.dmoz.org/rdf/content.rdf.u8.gz>. The
process went like this:
1) Unzipped the file using 7-zip. No errors reported.
2) Renamed the file by adding a .xml extension, mainly so Windows (see my
spec below) would recognise it as an XML file.
3) Had a look at the file in Oxygen's large document viewer. It took a few
minutes to load, but everything looked shipshape.
4) Opened a command prompt, navigated to the directory containing the file,
and started Python.
5) Entered: from lxml import etree
6) Entered: doc = open ('content.rdf.u8.xml', 'r')
7) Entered: docParsed = etree.parse(doc)
Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up
to around 96% (fair enough, it's a big document) and the Windows UI became
sluggish. It didn't crash, and the RAM usage stabilised around that amount,
with Windows Task Manager showing well under 10% CPU load from Python.
Still, I figured it might take a while to parse, so I left it overnight.
In the morning, I found the following error message immediately underneath
the command I'd entered in step 7:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2520, in lxml.etree.parse
File "parser.pxi", line 1331, in lxml.etree._parseDocument
File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument
File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike
File "parser.pxi", line 850, in
lxml.etree._BaseParser._parseDocFromFilelike
File "parser.pxi", line 452, in
lxml.etree._ParserContext._handleParseResultDoc
File "parser.pxi", line 536, in lxml.etree._handleParseResult
File "parser.pxi", line 478, in lxml.etree._raiseParseError
lxml.etree.XMLSyntaxError: Memory allocation failed : building node
I hope that's meaningful to someone, and that perhaps I might be able to get
some suggestions about how to parse the file on my PC.
Also, I was thinking of trying to parse the file on a virtual server that
only has 64M of RAM. I don't mind if the VPS takes a day or two, as long as
the code to make it work is fairly straightforward. So any suggestions about
that option would be helpful too.
Many thanks,
Sam
---
Macbook 2.13GHz with 2GB RAM
Windows Vista Home Premium via Leopard BootCamp
ActivePython 2.5.1
lxml installed via lxml-2.0.3-py2.5-win32.egg (this was the most up-to-date
egg that was available last time I checked, which was about a week or two
ago)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080522/45a75b4d/attachment.htm
More information about the lxml-dev
mailing list