tags:

views:

240

answers:

4

Where can I find an explanation and implementation of the diff algorithm?

First of all I have to recognize that I'm not sure if this is the correct name of the algorithm. For example, how does Stack Overflow mark the differences between two edits of the same question?

PS: I know the C and PHP programming languages.

A: 

The algorithm is described in the Wikipedia entry for diff.

Marcelo Cantos
+4  A: 

What is wrong with wikipedia where it states it is Hunt-McIlroy algorithm?

There is OCR'd paper describing the algorithm (explanation) and you can inspect source (implementation).

The so related questions also list (among others):
http://stackoverflow.com/questions/3144/best-diff-algorithm
http://stackoverflow.com/questions/1510225/how-do-document-diff-algorithms-work
http://stackoverflow.com/questions/805626/diff-algorithm
which all seem to be useful.

Unreason
+1  A: 

Looking at the sourcecode of a diff program might also be helpful.

ThiefMaster
+7  A: 

There is really no such thing as "the diff algorithm". There are many different diff algorithms, and in fact the particular diff algorithms used are in some cases considered a business advantage of the particular diff tool.

In general, many diff algorithms are based on the Longest Common Subsequence (LCS) problem.

The original Unix diff program from the 1970s was written by Doug McIllroy and uses what is known as the Hunt-McIllroy algorithm. Almost 40 years later, extensions and derivatives of that algorithm are still very common.

A couple of years ago, Bram Cohen (creator of the most successful filesharing program and the least successful version control system) created the Patience Diff algorithm that is designed to give more human-readable results than LCS. It was originally implemented in the Bazaar VCS and also added to Git as an option.

However, unless you are interested in research on diff algorithms, your best bet would probably be to just use some existing diff library like Davide Libenzi's LibXDiff, which is for example what Git uses. I wouldn't be too surprised if there was already a PHP extension wrapping it. An nice alternative is Google's Diff-Match-Patch library, which is used in Bespin or WhiteRoom, for example and which is available for many languages. It uses the Meyers Diff Algorithm plus some pre- and post-processing for additional speedups.

A completely different approach, if you are more interested in merging than diffing, is called Operational Transformations. The idea of OT is that instead of figuring out the differences between two documents, you try to "reverse engineer" the operations that led to those differences. This allows for much better merging, because you can then "replay" those operations. These are most useful for real-time collaborative editors such as EtherPad, Google Wave or SubEthaEdit.

Jörg W Mittag
many thx for your answer. Unfortunately I have only one vote and this time I'll love to rank it with more
dole doug
+1 very nice :)
Unreason