[lxml-dev] Strange problem with lxml.html.diff
Ian Bicking
ianb at colorstudy.com
Thu Mar 13 20:05:13 CET 2008
James Zhu wrote:
> Hi guys,
>
> Here's what I did:
>
> james at orchid ~ $ python
> Python 2.4.4 (#1, Mar 10 2008, 14:55:59)
> [GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from lxml import etree
>>>> etree.LXML_VERSION
> (2, 0, 2, 0)
>>>> from lxml.html.diff import htmldiff
>>>> doc1 = """some <p> test"""
>>>> doc2 = """some <p> text"""
>>>> print htmldiff(doc1, doc2)
> some <p><ins>text</ins></p> <p><del>test</del></p>
>>>> doc3 = """some <br> test"""
>>>> doc4 = """some <br> text"""
>>>> print htmldiff(doc3, doc4)
> some <br>
>
> It seems that the contents after <br> mysteriously disappeared. Any ideas?
Hmm... it looks like a bug with empty elements, generally. For instance:
>>> print htmldiff('<p>Some <img src="x"> text</p>', '<p>Some <img
src="x"> other</p>')
<p>Some <img src="x"></p>
At first I thought it might be something with block-level elements, but no:
>>> print htmldiff('<p>Some <span>x</span> text</p>', '<p>Some
<span>x</span> other</p>')
<p>Some <span>x</span> <ins>other</ins> <del>text</del> </p>
It looks like there's some code in htmldiff that drops empty tags,
ignoring their .tail. It might be a small fix, but I'm not sure, and
with PyCon I'm a little pressed for time, so I can't fix it right now
I'm afraid.
Ian
More information about the lxml-dev
mailing list