[lxml-dev] Trouble parsing large XML document with ElementTree
Sam Kuper
sam.kuper at uclmail.net
Sat May 24 15:25:52 CEST 2008
Dear Stefan,
I've tried the method you've suggested below, but it isn't quite working for
me. It may be that I've misunderstood your suggestion. I'll explain what
I've tried. Here is my python program, extract_links_dmoz.py:
from lxml import etree
infile = open("content.example.xml", "r")
infile.seek(0)
outfile = open("output_test001.txt", "w")
class EchoTarget():
def start(self, tag, attrib):
if tag.endswith("xternalPage"):
line = attrib["about"]
if line != "":
outfile.write(line+"\n")
print line
def close(self):
return "closed!"
parser = etree.XMLParser(target = EchoTarget())
result = etree.XML(infile.read(), parser)
This uses the short, example RDF file at
http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed
content.example.xml), and works fine. When I view the output_test001.txt
file, it contains one URL per line, which is exactly what I want for now.
However, if I change the program to read content.rdf.u8.xml (i.e. the
full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz)
instead of content.example.xml , then when I run the program I get the
following error:
Traceback (most recent call last):
File "extract_links_dmoz.py", line 26, in <module>
result = etree.XML(infile.read(), parser)
MemoryError
Any help you (or others) can offer would be greatly appreciated.
Many thanks,
Sam
2008/5/22 Stefan Behnel <stefan_ml at behnel.de>:
> Hi,
>
> Sam Kuper wrote:
> > Gosh, this is turning into a really fragmented post; apologies. I meant
> to
> > add to the first post that once parsed, my intention was to run a fairly
> > simple XSL transform on the document, to extract a copy of each of the
> URLs
> > it contains. Probably something like this:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xsl:stylesheet version="1.0" xmlns:xsl="
> > http://www.w3.org/1999/XSL/Transform">
> > <xsl:template match="/">
> > <html>
> > <body>
> > <h2>ODP URLs</h2>
> > <xsl:for-each select="Topic/link">
> > <p><xsl:value-of select="@r:resource"/></p>
> > </xsl:for-each>
> > </body>
> > </html>
> > </xsl:template>
> > </xsl:stylesheet>
>
> That is a problem that can be solved with extremely little memory. Take a
> look
> at the (SAX-like) target parser interface, which will not build a tree and
> instead just receive callbacks while parsing:
>
> http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
>
> Write a parser target class that keeps track of being inside or outside the
> "Topic" tag (start/end), and whenever you find a "link" tag while inside a
> "Topic" tag, look for a "{whatever-namespace}resource" attribute in the
> attrib
> dictionary and and write it into a hand-generated HTML stream like the one
> you
> used above.
>
> Stefan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/2372fd88/attachment-0001.htm
More information about the lxml-dev
mailing list