[lxml-dev] DOM tree intersection/comparison?

Viksit Gaur vik.list.nutch at gmail.com
Sat May 24 12:36:36 CEST 2008


Hey Mike,

Mike Meyer wrote:
> 
> I've written some code to diff two xml trees. The real issue is that
> "the differences between two trees" isn't really well defined. I.e. -
> does order of children matter? Not for attribute nodes, and maybe not
> for other nodes, depending on the application. What about whitespace?
> Same answer - some of it yes, some of it depends on the
> application. Look at a modern diff's different options for whitespace
> handling, then fold in XML's newline handling to see how nasty that
> can get.

Thanks for pointing out some interesting questions - I had thought of a 
couple, but I was counting on others not being too relevant to what I 
was doing.. Whitespace diffs are actually really bad - and I guess 
unicode is not going to sit pretty with the mix if I ever have to move 
to multi-lingual support.

> 
> FWIW, I'm not sure you get a "subtree" - more like forest. Or maybe it
> depends on exactly what you mean by "differences". I.e. - if an
> attribute changed value and that was the only difference, I wanted
> that attribute pulled out.  I could see where you might define things
> so that the difference was the largest common subtree, or some such.

The latter was what I was aiming for. Mostly, I'm not trying to compute 
an intersection between 2 trees, as much as constructing a compressed 
representation of them.

Cheers,
Viksit


More information about the lxml-dev mailing list