views:

1545

answers:

6
+10  Q: 

Diff Algorithm

I've been looking like crazy for an explanation of a diff algorithm that works and is efficient.

The closest I got is this link to RFC 3284 (from several Eric Sink blog posts), which describes in perfectly understandable terms the data format in which the diff results are stored. However, it has no mention whatsoever as to how a program would reach these results while doing a diff.

I'm trying to research this out of personal curiosity, because I'm sure there must be tradeoffs when implementing a diff algorithm, which are pretty clear sometimes when you look at diffs and wonder "why did the diff program chose this as a change instead of that?"...

Does anyone know where I can find a description of an efficient algorithm that'd end up outputting VCDIFF?
By the way, if you happen to find a description of the actual algorithm used by SourceGear's DiffMerge, that'd be even better.

NOTE: longest common subsequence doesn't seem to be the algorithm used by VCDIFF, it looks like they're doing something smarter, given the data format they use.

Thanks!

+8  A: 

What's wrong with looking at the actual source for diff? I seem to recall that GNU makes the source code available.

From the docs in that package:

The basic algorithm is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, 'Algorithmica' Vol. 1 No. 2, 1986, pp. 251-266; and in "A File Comparison Program", Webb Miller and Eugene W. Myers, 'Software--Practice and Experience' Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985, pp. 100-118.

paxdiablo
Hmmm, in short, sometimes figuring out the underlying algorithm from actual source code (especially if it's optimized to be efficient) can be quite complex. I will be able to understand what the program is doing step by step, but not exactly "why", or a high level overview about that...Example: You'd never understand how regular expressions work (or what they are) by looking at the implementation of Perl's Regexes. Or if you could do that, then I tip my hat, I definitely need a more explained, higher level overview to figure out what's going on.
Daniel Magliola
I never understand how the vast majority of Perl works :-), but there's a link in the package docs (see update) that will point you to the literature describing it (it being the diff algorithm, not Perl).
paxdiablo
Don't read the code. Read the paper.
Ira Baxter
+2  A: 

Nothing on wikipedia ?

You can maybe try to find another implementation in a hight level langage like python, that might be easier to understand than a C implementation. Python is famous for being easily readable ?

There's a difflib in python. Here's the url to the source. The source has tons of comments about diff algorithms. http://svn.python.org/view/python/trunk/Lib/difflib.py?revision=69846&view=markup

bsergean
A: 

I have my own diff-algorithm. It's not particularly fancy, it's in the works of being rewritten so it isn't feature-complete in terms of functions and methods, but it seems to work.

It does not rely on LCS to work, but outputs commands that looks like VCDIFF's. At the moment it isn't VCDIFF-dataformat compatible, but that's something I'm going to look into in the future.

The source for the algorithm is available online as a WebSVN View, and as a Subversion Repository (you'll need a Subversion Client for this one).

Let me know if it's something you would like to use, and I'll prioritize completing the unit tests and functionality. Also note that I'd be happy to answer any questions you might have about it or the way it works, if you just want to implement your own.

Lasse V. Karlsen
+7  A: 

An O(ND) Difference Algorithm and its Variations is a fantastic paper and you may want to start there. It includes pseudo-code and a nice visualization of the graph traversals involved in doing the diff.

Section 4 of the paper introduces some refinements to the algorithm that make it very effective.

Successfully implementing this will leave you with a very useful tool in your toolbox (and probably some excellent experience as well).

Generating the output format you need can sometimes be tricky, but if you have understanding of the algorithm internals, then you should be able to output anything you need. You can also introduce heuristics to affect the output and make certain tradeoffs.

Here is a page that includes a bit of documentation, full source code, and examples of a diff algorithm using the techniques in the aforementioned algorithm.

The source code appears to follow the basic algorithm closely and is easy to read.

There's also a bit on preparing the input, which you may find useful. There's a huge difference in output when you are diffing by character or token (word).

Good luck!

jscharf
+3  A: 

See http://code.google.com/p/google-diff-match-patch/

"The Diff Match and Patch libraries offer robust algorithms to perform the operations required for synchronizing plain text. ... Currently available in Java, JavaScript, C++, C# and Python"

Also see wikipedia.org Diff page and bramcohen.livejournal.com/37690.html - "bramcohen: The diff problem has been solved"

(last two links not linkified because of SOF anti-noob protection)

Emmelaich
Just wanted to mention that Cohen's algorithm also seem's to be known as Patience Diff. It's the (default?) diff algorithm in bazaar and an optional one in git.
Dale Hagglund
A: 

Based on the link Emmelaich gave, there is also a great run down of Diff Strategies on Neil Fraser's website (one of the authors of the library).

He covers basic strategies and towards the end of the article progresses to Myer's algorithm and some graph theory.

Chris S