1. How does Git detects that a tree needs to be created?
When you commit, git builds a tree hierarchy for the contents of the index and then builds a commit referencing the root of that tree hierarchy. After the git-add operation, the repository contains blob objects for all of the files added, and the index contains references to the blobs paired with path names. There are no tree objects yet.
When you commit (technically, during the write-tree operation), git recursively constructs a set of trees using the index information. It starts with the trees that contain only blobs, determines their identifiers, and writes the tree objects. Then it goes up each level and constructs the next set of trees, since this cannot happen before the subtree identifiers are known. Then it stores the root-level tree.
A commit operation is broken down into the write-tree and commit-tree steps. The write-tree step uses the current state of the index to identify and (if necessary) store all of the trees. The commit-tree step creates a new commit referencing all of the parent commits and the root tree that was just created.
2. What is stored beneath a tree at any given moment?
When you learn how to use git, the main focus is on the directed acyclic graph (DAG) of commits: Each commit contains a pointer to the previous commit, and you can go back in time by following these links. This makes sense, since the user interface is about commits, and trees are really secondary.
The trees also form a DAG, but the difference is that they do not represent the history of commits. Just like a blob, once a tree is created, its identifier will forever point to that tree with those contents. If any of the blobs or trees listed in a tree is modified or removed, it will have a new identifier, and the tree itself will have a new name in the next commit.
3. Is a new tree "revision" created any time a blob beneath that tree is modified?
Ok, let's say your repository looks like this:
foo/
a.txt
b.txt
bar/
a.txt
b.txt
and all of the files are empty. Then there are three objects in the repository, not counting the commit:
The top-level tree:
$ git cat-file -p ebf247ec5ebc97b12cd7a56db330141568edb946
040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7 bar
040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7 foo
A tree with two blobs:
$ git cat-file -p 2bdf04adb23d2b40b6085efb230856e5e2a775b7
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a.txt
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 b.txt
The empty blob:
$ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
First I'll explain why the trees foo
and bar
are stored by the same object, then I'll make a change and see what happens.
The SHA1 identifier of a tree is determined entirely by its content, just like a blob. Note that its name is not involved, which means that renaming a tree will recreate its parent, but the tree itself will not need to be restored. If you paste the above output to git mktree
, git will respond with the object name of the resulting tree. Under the hood, mktree
produces the SHA1 like this ruby code:
>> require 'digest/sha1'
>> sha1 = ['e69de29bb2d1d6434b8b29ae775ad8c2e48c5391'].pack 'H*'
>> contents = "100644 a.txt\0#{sha1}100644 b.txt\0#{sha1}"
>> data = "tree #{contents.length}\0#{contents}"
>> Digest::SHA1.hexdigest(data)
"2bdf04adb23d2b40b6085efb230856e5e2a775b7"
Now I'm going to modify 'bar/b.txt' and examine the new set of trees:
$ echo hello > bar/b.txt
$ git add bar/b.txt
$ git write-tree
5fa578acc6695bf2af2975ed0ffa7ab448b52c22
$ git cat-file -p 5fa578acc6695bf2af2975ed0ffa7ab448b52c22
040000 tree 9a514e08691a9f636665a43a1c89dc1920dab0fa bar
040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7 foo
Since nothing underneath 'foo' changed, it is stored as the exact same tree. For large structures, this is a huge space win. There is a new tree for 'bar', since I modified it:
$ git cat-file -p 9a514e08691a9f636665a43a1c89dc1920dab0fa
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 a.txt
100644 blob ce013625030ba8dba906f756967f9e9ca394464a b.txt
$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello
Again, nothing in the tree objects say anything about revisions or commits. If a tree and its children are unchanged from one commit to the next, they will be represented by the same object. If there are two identical trees in the same commit, they will also be represented by the same object.
Regarding the index, there is only a minimal link between it and the trees. One important distinction is that the index stores blob names and paths, uses a flat list, and does not mention trees at all:
$ git ls-files -s
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 bar/a.txt
100644 ce013625030ba8dba906f756967f9e9ca394464a 0 bar/b.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 foo/a.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 foo/b.txt
When data is copied from a tree to the index, the tree structure is flattened. When data is copied from the index to the trees, it is rebuilt.
References