[lxml-dev] Strange problem with lxml.html.diff

Ian Bicking ianb at colorstudy.com
Thu Mar 13 20:05:13 CET 2008


James Zhu wrote:
> Hi guys,
> 
> Here's what I did:
> 
> james at orchid ~ $ python
> Python 2.4.4 (#1, Mar 10 2008, 14:55:59)
> [GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from lxml import etree
>>>> etree.LXML_VERSION
> (2, 0, 2, 0)
>>>> from lxml.html.diff import htmldiff
>>>> doc1 = """some <p> test"""
>>>> doc2 = """some <p> text"""
>>>> print htmldiff(doc1, doc2)
> some <p><ins>text</ins></p> <p><del>test</del></p>
>>>> doc3 = """some <br> test"""
>>>> doc4 = """some <br> text"""
>>>> print htmldiff(doc3, doc4)
> some <br>
> 
> It seems that the contents after <br> mysteriously disappeared. Any ideas?

Hmm... it looks like a bug with empty elements, generally.  For instance:

 >>> print htmldiff('<p>Some <img src="x"> text</p>', '<p>Some <img 
src="x"> other</p>')
<p>Some <img src="x"></p>

At first I thought it might be something with block-level elements, but no:

 >>> print htmldiff('<p>Some <span>x</span> text</p>', '<p>Some 
<span>x</span> other</p>')
<p>Some <span>x</span> <ins>other</ins> <del>text</del> </p>

It looks like there's some code in htmldiff that drops empty tags, 
ignoring their .tail.  It might be a small fix, but I'm not sure, and 
with PyCon I'm a little pressed for time, so I can't fix it right now 
I'm afraid.

   Ian


More information about the lxml-dev mailing list