[lxml-dev] Python script to optimize XML text
Matthew Cruickshank
lxml at holloway.co.nz
Mon Sep 24 23:45:58 CEST 2007
Robert Dailey wrote:
> Note that the following were changed:
> - All comments were stripped from the XML
> - All spaces, tabs, carriage returns, and other forms of unimportant
> whitespace are removed
> - Elements that contain no text or children that are in the form of
> <Empty></Empty> use the short-hand method for ending an element body:
> <Empty/>
As Sidnei says an XSLT is probably the easiest way,
The first and third requirement are done by default in XSLT (I think),
so you'd only need to match text nodes and normalize them...
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
ps. please avoid using regexs with XML... that way leads to madness.
With the possibility of commented-out nodes and nested structures and
such regexs will only ever work on a tiny subset of XML.
.Matthew Cruickshank
http://docvert.org << MS Word to OpenDocument via an extensible XML Pipeline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070925/6e47eb01/attachment.htm
More information about the lxml-dev
mailing list