tags:

views:

49

answers:

2

As far as I know, Git's blob has SHA1 hash as file name. in order not to duplicate the file in the repository.

For example, if file A has a content of "abc" and has SHA1 hash as "12345", as long as the content doesn't change, the commits/branches can point to the same SHA1.

But, what would happen if file A is modified to "def" to have SHA hash "23456"? Does Git stores file A, and modified file A (not the difference only, but the whole file)?

  • If so, why is that? Isn't it better to store the diff info?
  • If not, how diff tracks the changes in a file?
  • How about the other VCS systems - CVS/SVN/Perforce...?

ADDED

The following from 'Git Community Book' answers most of my answers.

It is important to note that this is very different from most SCM systems that you may be familiar with. Subversion, CVS, Perforce, Mercurial and the like all use Delta Storage systems - they store the differences between one commit and the next. Git does not do this - it stores a snapshot of what all the files in your project look like in this tree structure each time you commit. This is a very important concept to understand when using Git.

+1  A: 

git stores files by content rather than diffs so in your example it stores the entire version of A in the object database.

  • It works out better to store whole objects because it is very easy to see if two versions of the file are the same or not just by looking at the names. Have a look at the git-book for details on how the objects are stored. This works out better because if files were tracked with diffs you would need the entire history of a file to reconstruct it. Easy to do in a centralised system, but not in a distributed system where there can be many different changes to a file.

  • Git performs the diff directly from the objects.

Abizern
The object model is one thing, but how Git actually stores those objects is an independent issue. Git can and does store objects in a diff-like way (“delta compression” in the pack files); see the later chapters in the afore linked [Git Community Book](http://book.git-scm.com/index.html): [How Git Stores Objects](http://book.git-scm.com/7_how_git_stores_objects.html), and [The Packfile](http://book.git-scm.com/7_the_packfile.html).
Chris Johnsen
True, but that is just an optimisation of the implementation isn't it? Git still thinks of individual objects, but for older or lesser used objects it stores the diffs. I take the OP's question to be more about the philosophy of Git's object model.
Abizern
+1  A: 

One of the design goals of git is speed. Consider storing objects in git as deltas rather than unique objects.

If you store each unique blob by SHA1 hash, retrieving the content from that SHA1 hash requires only a fixed computation. If you start storing deltas, you will have to reconstruct the object and the computation will no longer be fixed and could increase without bound depending on the implementation.

A good way to understand the design is to look at an actual repository (note: emails munged):

$ git cat-file commit HEAD
tree 21f9601e608cf62360fca43cd7f0bf05bb65bd23
parent 11507e17a7c823c379202ae344aa59fe5370a4fd
author John Doe <[email protected]> 1273816361 -0400
committer John Doe <[email protected]> 1273816361 -0400

Important Work

$ git ls-tree HEAD
100644 blob 2f6d9912344c299670551c9e9684a7cae800ec5d    .gitignore
...
100644 blob a3ddeb9dd0541b80981f2f78bbc500579a13459a    COPYING
040000 tree f1ac0acae2a4ab31c2a79b71f08ebd651136d706    contrib
...

You can see from these two commands that a commit is just some metadata, one or more parents and a tree. A tree contains one or more blobs and trees.

Knowing, that, you can start to consider the complexity of various repository operations. The tip of a branch is just a pointer to a commit hash. So, starting with that, listing history is just a matter of traversing the parents. Listing the contents of the tree, just means traversing the tree and all subtrees. Retrieving the file contents is as above.

Of course, there is always a trade-off, and this model is quite space-inefficient, though it does provide automatic deduplication at the file-level since each unique file only needs to be stored once. This is mitigated effectively with the packfile. Delta storage (used in svn, etc) is more space-efficient without compression, but git ultimately stores more efficiently.

To diff commits, you can see that you can start by comparing tree hashes, and then if they don't match, you traverse the tree and compare its blobs and trees, and so on. Since the model is designed around atomic commits, a file diff is more expensive, but not unreasonably so.

djs