[lxml-dev] Difference between xhtml etrees

Stefan Behnel stefan_ml at behnel.de
Tue Jun 16 16:00:21 CEST 2009


Hi,

please CC the list on replies.

D wrote:
> 2009/6/11 Stefan Behnel:
>> D wrote:
>>>  I have two xhtml documents which I would like to compare. They are
>>>  available as etrees.
>>>  Ideally I would like to have a resulting tree, where the appropriate
>>>  changes are marked with ins and del tags. I don't need anything fancy
>>> like a detection of moves.
>>>
>>>  I had a look at lxml.html.diff
>>>  http://codespeak.net/lxml/lxmlhtml.html#html-diff
>>>  but it operates on html strings only, and not on my parsed tree.
>>
>> Did you try passing the root elements of the trees?
>
> passing the root objects was a good idea, it can generates the
> difference the way I want it. I just don't manage to get the data back
> to xhtml. Maybe you could have a look:
>
> Here is my code:
> def expandFiles(filename):
>     """open the file named filename, return an etree"""
>         document = "".join(open(filename).readlines())
>         px = lxml.etree.XMLParser(load_dtd=True, no_network=False)
>         px.feed(document)
>         rx=px.close()
>         docx=lxml.etree.ElementTree(rx)
>         return docx

Note that "load_dtd" does not imply validation, just that a DTD will be
loaded if referenced.

Also, it is a *lot* more efficient to do this:

  parser = lxml.etree.XMLParser(load_dtd=True, no_network=False)

  def expandFiles(filename):
    """open the file named filename, return an etree"""
        return lxml.etree.parse(filename, parser)

... and I'd actually rename the function (or drop it completely).

> r1=expandFiles(r"1.xhtml")
> r2=expandFiles(r"2.xhtml")
> diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot())
> # diff is now an html fragment, parse it
> pdiff = lxml.html.document_fromstring(diff)
> lxml.html.html_to_xhtml(pdiff)
> pe = lxml.etree.ElementTree(pdiff)

So far, so good.


> # this gives me an xhtml file that is parsed without errors by
> firefox, but does not contain any markup
> # it looks like this in firefox: {http://www.w3.org/1999/xhtml}meta>
> Resist SPR3012 Preparation{http://www.w3.org

Not sure how this can happen. I'll give it a try later today.


> # in addition, all character entities apper in the form > and not
> like they should: #62;

Would you have a 'real' example here?


> I don't manage to transform pdiff to the same form r1 and r2 are in.
>
> I am sure this is due to a basic misunderstanding of lxml, maybe you
> directly see what I am doing wrong?

Not direcly, no. Maybe others have an idea?

Stefan



More information about the lxml-dev mailing list