views:

252

answers:

7

I need to write a simple source control system and wonder what algorithm I would use for file differences?

I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.

Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.

Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)

+3  A: 

Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.

"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.

JasonTrue
Hmm, I like your answer better. You actually know what you're talking about, it seems. :-P
Jaxidian
+1  A: 

I was actually thinking about something similar to this the other day... (odd, huh?)

I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.

As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.

Jaxidian
+4  A: 

How about looking the source code of Subversion ? its licensed under Apache License 2.0

Mahesh Velaga
Thanks. Have to check if Apache and MPL are compatible, but it appears so.
Michael Stum
+2  A: 

Though fossil is GPL, the delta algorithm is based on rsync and described here

Doug Currie
+4  A: 

Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.

Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.

Greg Hewgill
That's pretty cool. Still a variation of the LCS algorithm family, but it's a very nice refinement.
JasonTrue
Interesting! (pad, pad...)
j_random_hacker
+1  A: 

Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.

Norman Ramsey
+1  A: 

The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"

Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.

For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer

If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.

Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).

Ira Baxter