[lxml-dev] c14n, pretty printing and diffing

Olivier Collioud Olivier.Collioud at wipo.int
Tue Feb 12 10:52:59 CET 2008


For those interested by the iterparse method, the following is much
better:

            sourceTree = ElementTree.iterparse(open(inputDir+'/'+file,
'r'), events=("start", "end"))
            for event, elem in sourceTree:
                    
                if event == "start":
                    i = "\n" + depth*"  "
                    depth += 1
                    outputFile.write('%s<%s' % (i,elem.tag))
                    if len(elem.items()):
                        attrs = elem.items()
                        attrs.sort()
                        outputFile.write(' ')
                        outputFile.write(' '.join(['%s="%s"' %
(a[0],a[1]) for a in attrs if a[0] != 'size']))
                    outputFile.write('>')
                    if elem.text and elem.text.strip():
                       
outputFile.write(elem.text.strip('\n').encode('utf-8'))
                    
                if event == "end":
                    outputFile.write('%s</%s>' % (i,elem.tag))
                    if elem.tail and elem.tail.strip():
                       
outputFile.write(elem.tail.strip('\n').encode('utf-8'))
                    depth -= 1
                    elem.clear()

because when event == 'start' then len(elem) is always 0,
and I don't how to guess if the element will have some content in order
to produce en empty tag (or not).
Therefore,the above code always produce an element end tag even when
there is no content.

>>> "Olivier Collioud" <Olivier.Collioud at wipo.int> 12/02/08 7:26 am
>>>
Thanks Stephan.

I prefer visual diffing : the ones provided by Eclipse, TkDiff or
WinMerge.

I did not fin any doc or usage example of lxml.usedoctest,
could you please give some pointer ?

Let me share my simple (because I do not use any namespace, PI,
comment...)
solution based on iterparse:

    depth = 0
    sourceTree = ElementTree.iterparse(open(inputFile, 'r'),
events=("start", "end"))
    for event, elem in sourceTree:
            
        if event == "start":
            i = "\n" + depth*"  "
            depth += 1
            outputFile.write('%s<%s' % (i,elem.tag))
            if len(elem.items()):
                attrs = elem.items()
                attrs.sort()
                outputFile.write(' ')
                outputFile.write(' '.join(['%s="%s"' % (a[0],a[1]) for
a in attrs if a[0] != 'size']))
            if elem.text and elem.text.strip():
                outputFile.write('>%s' %
elem.text.strip('\n').encode('utf-8'))
            elif len(elem):
                outputFile.write('>')
            
        if event == "end":
            if (elem.text and elem.text.strip()) or len(elem):
                outputFile.write('%s</%s>' % (i,elem.tag))
            else:
                outputFile.write('/>')
            if elem.tail and elem.tail.strip():
               
outputFile.write(elem.tail.strip('\n').encode('utf-8'))
            depth -= 1
            elem.clear()

Olivier.

>>> Stefan Behnel <stefan_ml at behnel.de> 11/02/08 7:56 pm >>>
Hi,

Olivier Collioud wrote:
> I would like to use my favourite text diffing tool to compare XML
> files.

Which is not lxml.html.diff, I assume? (I'm not sure how HTML specific
that
is, BTW). Also, for doctests, there is lxml.usedoctest that you can
import
(the lxml web pages use it for doctests).


> Is their a way to produce a pretty printed canonical version of my
XML
> files using lxml ?

Not using the c14n interface (libxml2 doesn't support it). Serialising
by hand
is not too hard, though. You can look at ElementTree._write() for an
example:

http://svn.effbot.org/public/elementtree/elementtree/ElementTree.py 

Stefan

_______________________________________________
lxml-dev mailing list
lxml-dev at codespeak.net 
http://codespeak.net/mailman/listinfo/lxml-dev 


------
World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.


_______________________________________________
lxml-dev mailing list
lxml-dev at codespeak.net 
http://codespeak.net/mailman/listinfo/lxml-dev


------
World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.




More information about the lxml-dev mailing list