[lxml-dev] Python script to optimize XML text

Matthew Cruickshank lxml at holloway.co.nz
Mon Sep 24 23:45:58 CEST 2007


Robert Dailey wrote:
> Note that the following were changed:
> - All comments were stripped from the XML
> - All spaces, tabs, carriage returns, and other forms of unimportant 
> whitespace are removed
> - Elements that contain no text or children that are in the form of 
> <Empty></Empty> use the short-hand method for ending an element body: 
> <Empty/>

As Sidnei says an XSLT is probably the easiest way,

The first and third requirement are done by default in XSLT (I think), 
so you'd only need to match text nodes and normalize them...

<xsl:template match="text()">
    <xsl:value-of select="normalize-space(.)"/>
</xsl:template>

ps. please avoid using regexs with XML... that way leads to madness. 
With the possibility of commented-out nodes and nested structures and 
such regexs will only ever work on a tiny subset of XML.


.Matthew Cruickshank
http://docvert.org << MS Word to OpenDocument via an extensible XML Pipeline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070925/6e47eb01/attachment.htm 


More information about the lxml-dev mailing list