ansaurus

Question

Answer 1

+1 A:

git stores files by content rather than diffs so in your example it stores the entire version of A in the object database.

It works out better to store whole objects because it is very easy to see if two versions of the file are the same or not just by looking at the names. Have a look at the git-book for details on how the objects are stored. This works out better because if files were tracked with diffs you would need the entire history of a file to reconstruct it. Easy to do in a centralised system, but not in a distributed system where there can be many different changes to a file.
Git performs the diff directly from the objects.

Abizern 2010-09-19 03:01:04

The object model is one thing, but how Git actually stores those objects is an independent issue. Git can and does store objects in a diff-like way (“delta compression” in the pack files); see the later chapters in the afore linked [Git Community Book](http://book.git-scm.com/index.html): [How Git Stores Objects](http://book.git-scm.com/7_how_git_stores_objects.html), and [The Packfile](http://book.git-scm.com/7_the_packfile.html).

Chris Johnsen 2010-09-19 05:35:15

True, but that is just an optimisation of the implementation isn't it? Git still thinks of individual objects, but for older or lesser used objects it stores the diffs. I take the OP's question to be more about the philosophy of Git's object model.

Abizern 2010-09-19 05:56:54

Answer 2

+1 A:

One of the design goals of git is speed. Consider storing objects in git as deltas rather than unique objects.

If you store each unique blob by SHA1 hash, retrieving the content from that SHA1 hash requires only a fixed computation. If you start storing deltas, you will have to reconstruct the object and the computation will no longer be fixed and could increase without bound depending on the implementation.

A good way to understand the design is to look at an actual repository (note: emails munged):

$ git cat-file commit HEAD
tree 21f9601e608cf62360fca43cd7f0bf05bb65bd23
parent 11507e17a7c823c379202ae344aa59fe5370a4fd
author John Doe <[email protected]> 1273816361 -0400
committer John Doe <[email protected]> 1273816361 -0400

Important Work

$ git ls-tree HEAD
100644 blob 2f6d9912344c299670551c9e9684a7cae800ec5d    .gitignore
...
100644 blob a3ddeb9dd0541b80981f2f78bbc500579a13459a    COPYING
040000 tree f1ac0acae2a4ab31c2a79b71f08ebd651136d706    contrib
...

You can see from these two commands that a commit is just some metadata, one or more parents and a tree. A tree contains one or more blobs and trees.

Knowing, that, you can start to consider the complexity of various repository operations. The tip of a branch is just a pointer to a commit hash. So, starting with that, listing history is just a matter of traversing the parents. Listing the contents of the tree, just means traversing the tree and all subtrees. Retrieving the file contents is as above.

Of course, there is always a trade-off, and this model is quite space-inefficient, though it does provide automatic deduplication at the file-level since each unique file only needs to be stored once. This is mitigated effectively with the packfile. Delta storage (used in svn, etc) is more space-efficient without compression, but git ultimately stores more efficiently.

To diff commits, you can see that you can start by comparing tree hashes, and then if they don't match, you traverse the tree and compare its blobs and trees, and so on. Since the model is designed around atomic commits, a file diff is more expensive, but not unreasonably so.

djs 2010-09-19 21:17:05

ansaurus

tags:

views:

answers:

Git's blob data and diff information.

ADDED

related questions