[lxml-dev] Difference between xhtml etrees

D dalist0 at gmail.com
Tue Jun 16 23:08:41 CEST 2009


Hi All,


> please CC the list on replies.
I am sorry, I pressed the wrong button.

I made a running example and attached three small files, the code
finds the difference between the two files r1.xhtml and r2.xhtml.
The output is written to the file rdiff.xhtml. This file does not
display correctly in Firefox.

Please note that the output diff is not totally correct. r1 reads
"Leave some solvent in the bowl."
and r2
"Leave some solvent in the bowl and heat."
the code marks:
<html:ins>bowl and heat.END{http://www.w3.org/1999/xhtml}p&gt;
{http://www.w3.org/1999/xhtml}p&gt; Previous Versions:
{http://www.w3.org/1999/xhtml}b&gt;{http://www.w3.org/1999/xhtml}p&gt;</html:ins>
as inserted, i.e. "bowl and heat." instead of "and heat"

> Note that "load_dtd" does not imply validation, just that a DTD will be
> loaded if referenced.
unfortunately my original xhtml is very non-conforming. (I am planning
to migrate a laboratory notebook that
was unfortunately written in word. The plan is to copy from word,
paste into Kompozer, then parse the result, get rid of all the
word-specific stuff and validate later. This is necessary because each
experiment is composed of many smaller descriptions which will be put
together into big file. Unfortunately word 2007 still can not handle a
master document that contains other documents)

best

Daniel




def minimalExample():
    # files contain entities like &nbsp;
    # often r contains illegal attributes (start , type in ol),  not
DTD conforming element content (br), and illegally nested paragraphs
(p in p, p in b)
    parser = lxml.etree.XMLParser(load_dtd=True, dtd_validation=True,
no_network=False)
    r1 = lxml.etree.parse("r1.xhtml", parser)
    r2 = lxml.etree.parse("r2.xhtml", parser)
    diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot())
    pdiff = lxml.html.document_fromstring(diff)
    lxml.html.html_to_xhtml(pdiff)
    pe = lxml.etree.ElementTree(pdiff)
    pe.write("rdiff.xhtml",pretty_print = True)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r1.xhtml
Type: application/xhtml+xml
Size: 850 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment.xhtml 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r2.xhtml
Type: application/xhtml+xml
Size: 845 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment-0001.xhtml 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rdiff.xhtml
Type: application/xhtml+xml
Size: 1075 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment-0002.xhtml 


More information about the lxml-dev mailing list