[lxml-dev] Python script to optimize XML text
Stefan Behnel
stefan_ml at behnel.de
Tue Sep 25 09:22:44 CEST 2007
Robert Dailey wrote:
> I'm currently seeking a python script that provides a way of optimizing
> out useless characters in an XML document to provide the optimal size
> for the file.
Have you tried the "remove_blank_text" and "remove_comments" keyword options
of the XMLParser? Try
>>> help(etree.XMLParser)
They may not always produce an "optimal" result, but that's because there is
no such thing as an "optimal" result (as Mike already noted). What is "useless
characters" in XML is very much application dependent. Just think of an XHTML
document where all text content was stripped:
... <p>some <b>bold</b> text</p> ...
or even
... some<span style="..."> <cite>cited</cite> </span>text ...
Not a good idea to remove all whitespace-only content here, IMHO.
A good way to help the parser understand what you consider "useless" is to
provide a DTD.
Stefan
More information about the lxml-dev
mailing list