[lxml-dev] Trouble parsing large XML document with ElementTree
Sam Kuper
sam.kuper at uclmail.net
Sat May 24 16:19:32 CEST 2008
To add to the message below, I've just tried running a much simpler program
that doesn't call lxml to see if the memory error is a Python/environment
one rather than being due to lxml. It turns out to be:
>>> infile = open("content.rdf.u8.xml", "r")
>>> print infile.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
Ok, so clearly Python isn't happy to read content.rdf.u8.xml in one go. The
normal workaround for processing large text files piece by piece seems to be
either to set a byte limit on how much is read at once, or to read the file
line by line. However, neither of those will work in this case because they
won't produce well-formed XML that the target parser interface can handle
(correct me if I'm wrong).
I'm sure there must be a fairly easy solution to this, but it's eluding me.
All assistance greatly appreciated!
Sam
2008/5/24 Sam Kuper <sam.kuper at uclmail.net>:
> Dear Stefan,
>
> I've tried the method you've suggested below, but it isn't quite working
> for me. It may be that I've misunderstood your suggestion. I'll explain what
> I've tried. Here is my python program, extract_links_dmoz.py:
>
> from lxml import etree
> infile = open("content.example.xml", "r")
> infile.seek(0)
> outfile = open("output_test001.txt", "w")
> class EchoTarget():
> def start(self, tag, attrib):
> if tag.endswith("xternalPage"):
> line = attrib["about"]
> if line != "":
> outfile.write(line+"\n")
> print line
> def close(self):
> return "closed!"
> parser = etree.XMLParser(target = EchoTarget())
> result = etree.XML(infile.read(), parser)
>
> This uses the short, example RDF file at
> http://rdf.dmoz.org/rdf/content.example.txt (which I have renamed
> content.example.xml), and works fine. When I view the output_test001.txt
> file, it contains one URL per line, which is exactly what I want for now.
>
> However, if I change the program to read content.rdf.u8.xml (i.e. the
> full-length DMOZ links file from http://rdf.dmoz.org/rdf/content.rdf.u8.gz)
> instead of content.example.xml , then when I run the program I get the
> following error:
>
> Traceback (most recent call last):
> File "extract_links_dmoz.py", line 26, in <module>
> result = etree.XML(infile.read(), parser)
> MemoryError
>
> Any help you (or others) can offer would be greatly appreciated.
>
> Many thanks,
>
> Sam
>
> 2008/5/22 Stefan Behnel <stefan_ml at behnel.de>:
>
>> Hi,
>>
>> Sam Kuper wrote:
>> > Gosh, this is turning into a really fragmented post; apologies. I meant
>> to
>> > add to the first post that once parsed, my intention was to run a fairly
>> > simple XSL transform on the document, to extract a copy of each of the
>> URLs
>> > it contains. Probably something like this:
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <xsl:stylesheet version="1.0" xmlns:xsl="
>> > http://www.w3.org/1999/XSL/Transform">
>> > <xsl:template match="/">
>> > <html>
>> > <body>
>> > <h2>ODP URLs</h2>
>> > <xsl:for-each select="Topic/link">
>> > <p><xsl:value-of select="@r:resource"/></p>
>> > </xsl:for-each>
>> > </body>
>> > </html>
>> > </xsl:template>
>> > </xsl:stylesheet>
>>
>> That is a problem that can be solved with extremely little memory. Take a
>> look
>> at the (SAX-like) target parser interface, which will not build a tree and
>> instead just receive callbacks while parsing:
>>
>> http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface
>>
>> Write a parser target class that keeps track of being inside or outside
>> the
>> "Topic" tag (start/end), and whenever you find a "link" tag while inside a
>> "Topic" tag, look for a "{whatever-namespace}resource" attribute in the
>> attrib
>> dictionary and and write it into a hand-generated HTML stream like the one
>> you
>> used above.
>>
>> Stefan
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/53e27859/attachment.htm
More information about the lxml-dev
mailing list