[lxml-dev] Trouble parsing large XML document with ElementTree

Stefan Behnel stefan_ml at behnel.de
Thu May 22 10:19:37 CEST 2008


Hi,

Sam Kuper wrote:
> Gosh, this is turning into a really fragmented post; apologies. I meant to
> add to the first post that once parsed, my intention was to run a fairly
> simple XSL transform on the document, to extract a copy of each of the URLs
> it contains. Probably something like this:
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="1.0" xmlns:xsl="
> http://www.w3.org/1999/XSL/Transform">
>     <xsl:template match="/">
>         <html>
>             <body>
>                 <h2>ODP URLs</h2>
>                 <xsl:for-each select="Topic/link">
>                     <p><xsl:value-of select="@r:resource"/></p>
>                 </xsl:for-each>
>             </body>
>         </html>
>     </xsl:template>
> </xsl:stylesheet>

That is a problem that can be solved with extremely little memory. Take a look
at the (SAX-like) target parser interface, which will not build a tree and
instead just receive callbacks while parsing:

  http://codespeak.net/lxml/dev/parsing.html#the-target-parser-interface

Write a parser target class that keeps track of being inside or outside the
"Topic" tag (start/end), and whenever you find a "link" tag while inside a
"Topic" tag, look for a "{whatever-namespace}resource" attribute in the attrib
dictionary and and write it into a hand-generated HTML stream like the one you
used above.

Stefan


More information about the lxml-dev mailing list