[lxml-dev] Difference between xhtml etrees
Stefan Behnel
stefan_ml at behnel.de
Tue Jun 16 16:00:21 CEST 2009
Hi,
please CC the list on replies.
D wrote:
> 2009/6/11 Stefan Behnel:
>> D wrote:
>>> I have two xhtml documents which I would like to compare. They are
>>> available as etrees.
>>> Ideally I would like to have a resulting tree, where the appropriate
>>> changes are marked with ins and del tags. I don't need anything fancy
>>> like a detection of moves.
>>>
>>> I had a look at lxml.html.diff
>>> http://codespeak.net/lxml/lxmlhtml.html#html-diff
>>> but it operates on html strings only, and not on my parsed tree.
>>
>> Did you try passing the root elements of the trees?
>
> passing the root objects was a good idea, it can generates the
> difference the way I want it. I just don't manage to get the data back
> to xhtml. Maybe you could have a look:
>
> Here is my code:
> def expandFiles(filename):
> """open the file named filename, return an etree"""
> document = "".join(open(filename).readlines())
> px = lxml.etree.XMLParser(load_dtd=True, no_network=False)
> px.feed(document)
> rx=px.close()
> docx=lxml.etree.ElementTree(rx)
> return docx
Note that "load_dtd" does not imply validation, just that a DTD will be
loaded if referenced.
Also, it is a *lot* more efficient to do this:
parser = lxml.etree.XMLParser(load_dtd=True, no_network=False)
def expandFiles(filename):
"""open the file named filename, return an etree"""
return lxml.etree.parse(filename, parser)
... and I'd actually rename the function (or drop it completely).
> r1=expandFiles(r"1.xhtml")
> r2=expandFiles(r"2.xhtml")
> diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot())
> # diff is now an html fragment, parse it
> pdiff = lxml.html.document_fromstring(diff)
> lxml.html.html_to_xhtml(pdiff)
> pe = lxml.etree.ElementTree(pdiff)
So far, so good.
> # this gives me an xhtml file that is parsed without errors by
> firefox, but does not contain any markup
> # it looks like this in firefox: {http://www.w3.org/1999/xhtml}meta>
> Resist SPR3012 Preparation{http://www.w3.org
Not sure how this can happen. I'll give it a try later today.
> # in addition, all character entities apper in the form > and not
> like they should: #62;
Would you have a 'real' example here?
> I don't manage to transform pdiff to the same form r1 and r2 are in.
>
> I am sure this is due to a basic misunderstanding of lxml, maybe you
> directly see what I am doing wrong?
Not direcly, no. Maybe others have an idea?
Stefan
More information about the lxml-dev
mailing list