[lxml-dev] Python script to optimize XML text
Mike Meyer
mwm-keyword-lxml.9112b8 at mired.org
Tue Sep 25 00:36:33 CEST 2007
On Mon, 24 Sep 2007 17:02:43 -0500 "Robert Dailey" <rcdailey at gmail.com> wrote:
> If I wanted to remove all whitespace between elements, I would use this
> regex:
>
> exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL )
>
> However, this isn't working for some reason. I'm fairly new to regular
> expressions so I may be missing something obvious. Thanks.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
-- Jamie Zawinski
I think you can do this relatively simply with standard ElementTree
tools:
>>> doc = parse('.paneconfig.xml')
>>> for e in doc.xpath('//*'):
... if e.tail: e.tail = e.tail.strip()
... if e.text: e.text = e.text.strip()
...
However:
- The regular expression version will be faster if you don't
otherwise have to deal with the text as XML.
- "unimportant whitespace" is *very* much an application-dependent
definition. The solution I just gave you and the one presented
above are different. The very statement implies that you're using
XML as a data language, not a markup language, and my version works
fine for the applications I've done that do that. Doesn't mean it's
right for you, though.
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
More information about the lxml-dev
mailing list