[lxml-dev] Python script to optimize XML text

Mike Meyer mwm-keyword-lxml.9112b8 at mired.org
Tue Sep 25 00:36:33 CEST 2007


On Mon, 24 Sep 2007 17:02:43 -0500 "Robert Dailey" <rcdailey at gmail.com> wrote:

> If I wanted to remove all whitespace between elements, I would use this
> regex:
> 
> exp = re.compile( ">[\t\n ]+<",re.IGNORECASE | re.DOTALL )
> 
> However, this isn't working for some reason. I'm fairly new to regular
> expressions so I may be missing something obvious. Thanks.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
    -- Jamie Zawinski

I think you can do this relatively simply with standard ElementTree
tools:

>>> doc = parse('.paneconfig.xml')  
>>> for e in doc.xpath('//*'):
...  if e.tail: e.tail = e.tail.strip()
...  if e.text: e.text = e.text.strip()
... 

However:

- The regular expression version will be faster if you don't
  otherwise have to deal with the text as XML.
- "unimportant whitespace" is *very* much an application-dependent
  definition. The solution I just gave you and the one presented
  above are different. The very statement implies that you're using
  XML as a data language, not a markup language, and my version works
  fine for the applications I've done that do that. Doesn't mean it's 
  right for you, though.

       <mike
-- 
Mike Meyer <mwm at mired.org>		http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.


More information about the lxml-dev mailing list