[lxml-dev] Python script to optimize XML text
Robert Dailey
rcdailey at gmail.com
Tue Sep 25 00:05:31 CEST 2007
Woops! Never mind! I also need \r in there :) Works perfectly now
On 9/24/07, Robert Dailey <rcdailey at gmail.com> wrote:
>
> If I wanted to remove all whitespace between elements, I would use this
> regex:
>
> exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL )
>
> However, this isn't working for some reason. I'm fairly new to regular
> expressions so I may be missing something obvious. Thanks.
>
> On 9/24/07, Matthew Cruickshank <lxml at holloway.co.nz> wrote:
>
> > Robert Dailey wrote:
> >
> > Note that the following were changed:
> > - All comments were stripped from the XML
> > - All spaces, tabs, carriage returns, and other forms of unimportant
> > whitespace are removed
> > - Elements that contain no text or children that are in the form of
> > <Empty></Empty> use the short-hand method for ending an element body:
> > <Empty/>
> >
> >
> > As Sidnei says an XSLT is probably the easiest way,
> >
> > The first and third requirement are done by default in XSLT (I think),
> > so you'd only need to match text nodes and normalize them...
> >
> > <xsl:template match="text()">
> > <xsl:value-of select="normalize-space(.)"/>
> > </xsl:template>
> >
> > ps. please avoid using regexs with XML... that way leads to madness.
> > With the possibility of commented-out nodes and nested structures and such
> > regexs will only ever work on a tiny subset of XML.
> >
> >
> > .Matthew Cruickshank
> > http://docvert.org << MS Word to OpenDocument via an extensible XML
> > Pipeline
> >
> >
> > _______________________________________________
> > lxml-dev mailing list
> > lxml-dev at codespeak.net
> > http://codespeak.net/mailman/listinfo/lxml-dev
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070924/69673618/attachment.htm
More information about the lxml-dev
mailing list