views:

272

answers:

4

How do applications like DiffMerge detect differences in text files, and how do they determine when a line is new, and not just on a different line than the file being checked against?

Is this something that is fairly easy to implement? Are there already libraries to do this?

+5  A: 

Here's the paper that served as the basis for the UNIX command-line tool diff.

Kristo
+3  A: 

It actually is pretty simple; DIFF programes - most of the time - are based on the Longest Common Sequence, which can be solved using a graph algorithm.

This web page gives example implementations in C#.

beef2k
+4  A: 

That's a complex question. Performing a diff means finding the minimum edit distance between the two files. That is, the minimum number of changes you must make to transform one file into the other. This is equivalent to finding the longest common subsequence of lines between the two files, and this is the basis for the various diff programs. The longest common subsequence problem is well known, and you should be able to find the dynamic programming solution on google.

The trouble with the dynamic programming approach is that it's O(n^2). It's thus very slow on large files and unusable for large, binary strings. The hard part in writing a diff program is optimizing the algorithm for your problem domain, so that you get reasonable performance (and reasonable results). The paper "An Algorithm for Differential File Comparison" by Hunt and McIlroy gives a good description of an early version of the Unix diff utility.

Peter Ruderman
The files I will be differencing are very small, 10 - 50 lines, so the speed of the algorithm isn't a problem.
scottm
And Kristo already mentioned a paper that reduces it to O(ND).
beef2k
+3  A: 

There are libraries. Here's one: http://code.google.com/p/google-diff-match-patch/

StackOverflow uses Beyond Compare for its diff. I believe it works by calling Beyond Compare from the command line.

Nosredna