[lxml-dev] Python script to optimize XML text

Robert Dailey rcdailey at gmail.com
Tue Sep 25 00:02:43 CEST 2007


If I wanted to remove all whitespace between elements, I would use this
regex:

exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL )

However, this isn't working for some reason. I'm fairly new to regular
expressions so I may be missing something obvious. Thanks.

On 9/24/07, Matthew Cruickshank <lxml at holloway.co.nz> wrote:
>
>  Robert Dailey wrote:
>
> Note that the following were changed:
> - All comments were stripped from the XML
> - All spaces, tabs, carriage returns, and other forms of unimportant
> whitespace are removed
> - Elements that contain no text or children that are in the form of
> <Empty></Empty> use the short-hand method for ending an element body:
> <Empty/>
>
>
> As Sidnei says an XSLT is probably the easiest way,
>
> The first and third requirement are done by default in XSLT (I think), so
> you'd only need to match text nodes and normalize them...
>
> <xsl:template match="text()">
>     <xsl:value-of select="normalize-space(.)"/>
> </xsl:template>
>
> ps. please avoid using regexs with XML... that way leads to madness. With
> the possibility of commented-out nodes and nested structures and such regexs
> will only ever work on a tiny subset of XML.
>
>
> .Matthew Cruickshank
> http://docvert.org << MS Word to OpenDocument via an extensible XML
> Pipeline
>
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070924/9a769a94/attachment.htm 


More information about the lxml-dev mailing list