views:

95

answers:

1

I have two pieces of text. I would like to make a word-based diff between them (like whe unix utility wdiff does) but with more information in the output (I mean, the character's posizion where the added/delited word starts).

I need to do this in Java, so a simple output of the differences (like wdiff) doesn't suite for me: I would like to manipulate objects representing differences.

+1  A: 

There's Diff,Match,Patch - available in Java, and a demo is avilable - it seems to do word differences.

mdma
I tried a lot it and is baically char-based. If you want a human outuput you have to set a very high time, the computation is really slow and however is not word based (i mean "house" and "wife" are find to be different only in "hous" and "wif")
Mycol
Did you see the section on post-processing cleanup? You may be able to add a post processor that aligns differences to words. Is it for English text? When you raise the level to words, the problem becomes more complex. Even just tokenizing the text accurately into words is some effort, and then you have the problem of disambiguating differences - changes can be interpreted in several ways - which one makes sense may depend upon your application. Dealing with blocks of text cut and pasted to a different place is in principle one operation, but detecting this can be difficult.
mdma
If you can map words to characters (e.g. ensure there are no more than 64k unique words.) Then you can parse the text yourself, map each word to a character and run character differencing on that. Of course, if the implementation of the Diff algorithm is such that you can easily replace the data types being compared, then you may be able to trivially implement word differencing, by passing word objects as input rather than chars. I haven't seen the Diff api, so I can't say for sure.
mdma