Git design decision on storing content rather than differences

views:

228

answers:

+6 Q:

Git design decision on storing content rather than differences

Could anyone give me some idea to why git developers made a design decision to store contents of files (blobs), so when the content changes a new blob needs to be created?

I believe subversion stores revisions rather than contents, so when the content changes, it simply keeps track of the differences between the two. Couldn't git have done it like this as well? What's the benefit of storing contents rather than revisions?

+9 A:

I couldn't find the answer with a quick google, but I believe it boils down to a simple "it doesn't matter 'cause disk space is cheap".

Storing revisions within a source code management tool is tricky. If you only ever store the difference between the previous revision and the current, you end up with two problems:

Returning the latest revision (the common case) requires the most work, as the code needs to assemble that revision by combining every revision together.
Any error (say, a disk fault) to one revision corrupts access to every later revision.

I believe that most modern VCS actually store the latest revision (for performance reasons) and differences, if used, are used to go back in time, not forwards.

Bevan 2009-09-21 04:26:29

Thank you! Makes a lot of sense.

suesugi 2009-09-21 04:40:21

Git has also 'packed' format, where it stores most of objects in deltaified form. Recency order (most recent objects as base of delta) is preferred (but not enforced).

Jakub Narębski 2009-09-21 07:47:38

+5 A:

An article that addresses this (and related) issues is Repository Formats Matter. This was one of the articles that influenced my decision to move to Git a couple of years ago. Here is an excerpt:

Given this argument, it should be clear that I think git’s repository structure is better than others, at least for X.org’s usage model. It seems to hold several interesting properties:

Files containing object data are never modified. Once written, every file is read-only from that point forward.

Compression is done off-line and can be delayed until after the primary objects are saved to backup media. This method provides better compression than any incremental approach, allowing data to be re-ordered on disk to match usage patterns.

Object data is inherently self-checking; you cannot modify an object in the repository and escape detection the first time the object is referenced.

Greg Hewgill 2009-09-21 04:56:20

+4 A:

Let me clear up your misconceptions:

Could anyone give me some idea to why git developers made a design decision to store contents of files (blobs), so when the content changes a new blob needs to be created?

Quite good explanation of the (initial) Git design can be found in Tom Preston-Werner's The Git Parable essay (in addition to the one linked to in Greg Hewgill answer).

The idea behind it is that usually (in large enough project) in a new revision only a few files out of large number of files in a project change, so storing only different versions of the file contents saves space. This is the same idea that Subversion uses in its 'cheap copy' technique (it uses hardlinking, IIRC).

Also the contents of the file is zlib (deflate) compressed (or to be more exact each object in git repository database is compressed, including comit objects).

I believe Subversion stores revisions rather than contents, so when the content changes, it simply keeps track of the differences between the two. Couldn't git have done it like this as well? What's the benefit of storing contents rather than revisions?

I don't understand what you wanted to say here.

If it was that storing differences saves space, then I'd like to tell you that in addition to the 'loose' format (where each blob, i.e. each (different) contents of a file is stored in separate file inside .git) has also 'packed' format, where many objects are stored in deltaified form, using binary delta from LibXDiff library.

This format was created for network transfer (large disk space might be cheap, but bandwidth isn't), and was adapted as also on-disk format. This format is very efficient, one of more efficient if not most efficient version control systems formats, making git repositories smalles or one of smallest among different version control systems. Depending on circumstances full clone of git repository (which contains full history) might be smaller than equivalent Subversion checkout (which contains extra copy of pristine changes so that svn diff and svn status work without need for network transfer, with reasonable speed).

This design ('loose' and 'packed' format) has the advantage of very efficient packing, but had the disadvantage that you had to repack manually using "git gc" (not for disk space, but for performance - disk I/O); nowadays most git commands repack repository (safely) when needed.

Jakub Narębski 2009-09-21 07:46:14

Very interesting. +1

VonC 2009-09-21 08:19:36

Thanks, I think I have a poor understanding of how subversion's 'cheap copy' technique works. To clarify, let's say a directory under svn control has files A, B, C. Now, let's say I copy this directory and put this new directory under svn as well. Now let's say I change A to A'. I believe subversion will now store contents of A' in the repo. Is this correct?(I was thinking earlier that subversion will store A, and "diff A A'". I don't think this is correct...)

suesugi 2009-09-22 10:45:45

Actually, subversion does seem to store A and diff A A' ( http://subversion.tigris.org/design.html ). So never mind about the previous comment...

suesugi 2009-09-22 10:53:24

On the **design** level Subversion is about cheap copies, Git is about snapshots and 'loose' objects; on the **engine** level Subversion stores deltas... and Git also if one uses 'packed' format.

Jakub Narębski 2009-09-22 12:07:57

ansaurus

tags:

views:

answers:

Git design decision on storing content rather than differences

related questions