[lxml-dev] Python script to optimize XML text

Stefan Behnel stefan_ml at behnel.de
Tue Sep 25 09:22:44 CEST 2007


Robert Dailey wrote:
> I'm currently seeking a python script that provides a way of optimizing
> out useless characters in an XML document to provide the optimal size
> for the file.

Have you tried the "remove_blank_text" and "remove_comments" keyword options
of the XMLParser? Try

   >>> help(etree.XMLParser)

They may not always produce an "optimal" result, but that's because there is
no such thing as an "optimal" result (as Mike already noted). What is "useless
characters" in XML is very much application dependent. Just think of an XHTML
document where all text content was stripped:

   ... <p>some <b>bold</b> text</p> ...

or even

   ... some<span style="..."> <cite>cited</cite> </span>text ...

Not a good idea to remove all whitespace-only content here, IMHO.

A good way to help the parser understand what you consider "useless" is to
provide a DTD.

Stefan


More information about the lxml-dev mailing list