Dear lovely lxmlves,<div><br></div><div>Yesterday I tried to parse a large file, the Open Directory Project's links document, available <a href="http://rdf.dmoz.org/rdf/content.rdf.u8.gz">here</a>. The process went like this:</div>
<div><br></div><div>1) Unzipped the file using 7-zip. No errors reported.</div><div>2) Renamed the file by adding a .xml extension, mainly so Windows (see my spec below) would recognise it as an XML file.</div><div>3) Had a look at the file in Oxygen's large document viewer. It took a few minutes to load, but everything looked shipshape.</div>
<div>4) Opened a command prompt, navigated to the directory containing the file, and started Python.</div><div>5) Entered: from lxml import etree</div><div>6) Entered: doc = open ('content.rdf.u8.xml', 'r')</div>
<div>7) Entered: docParsed = etree.parse(doc)</div><div><br></div><div>Steps 4, 5 and 6 all went smoothly, but after step 7, the RAM usage went up to around 96% (fair enough, it's a big document) and the Windows UI became sluggish. It didn't crash, and the RAM usage stabilised around that amount, with Windows Task Manager showing well under 10% CPU load from Python. Still, I figured it might take a while to parse, so I left it overnight.</div>
<div><br></div><div>In the morning, I found the following error message immediately underneath the command I'd entered in step 7:</div><div><br></div><div><div>Traceback (most recent call last):</div><div> File "<stdin>", line 1, in <module></div>
<div> File "lxml.etree.pyx", line 2520, in lxml.etree.parse</div><div> File "parser.pxi", line 1331, in lxml.etree._parseDocument</div><div> File "parser.pxi", line 1361, in lxml.etree._parseFilelikeDocument</div>
<div> File "parser.pxi", line 1254, in lxml.etree._parseDocFromFilelike</div><div> File "parser.pxi", line 850, in lxml.etree._BaseParser._parseDocFromFilelike</div><div> File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc</div>
<div> File "parser.pxi", line 536, in lxml.etree._handleParseResult</div><div> File "parser.pxi", line 478, in lxml.etree._raiseParseError</div><div>lxml.etree.XMLSyntaxError: Memory allocation failed : building node</div>
<div><br></div><div>I hope that's meaningful to someone, and that perhaps I might be able to get some suggestions about how to parse the file on my PC.</div><div><br></div><div>Also, I was thinking of trying to parse the file on a virtual server that only has 64M of RAM. I don't mind if the VPS takes a day or two, as long as the code to make it work is fairly straightforward. So any suggestions about that option would be helpful too.</div>
<div><br></div><div>Many thanks,</div><div><br></div><div>Sam</div><div>---</div><div>Macbook 2.13GHz with 2GB RAM<br></div><div>Windows Vista Home Premium via Leopard BootCamp</div><div>ActivePython 2.5.1</div><div>lxml installed via lxml-2.0.3-py2.5-win32.egg (this was the most up-to-date egg that was available last time I checked, which was about a week or two ago)<br>
</div></div>