tags:

views:

159

answers:

3

I've read and searched, searched and read, rinse, repeat, but a fundamental understanding of trees in Git continues to elude me (beyond the fact that they're loosely analogous to file system directories). They seem to be intrinsically linked to the index, but I just can't get the how through my thick skull.

Blobs are easy, of course, because they're a granular thing. Trees, at least conceptually, feel much more nebulous to me. Is there some way of explaining--in something approaching a remedial manner:

  1. How does Git detects that a tree needs to be created?
  2. What is stored beneath a tree at any given moment?
  3. Is a new tree "revision" created any time a blob beneath that tree is modified?

There may be other questions that I don't even know enough to ask, so feel free to elaborate in any way necessary to facilitate a coherent understanding of the object type and its context.

Much appreciated.

+8  A: 

This can be a first description:

alt text
(From Git for Computer Scientists)

But Git From the Bottom Up will have the most detailed description.

the index
Unlike other, similar tools you may have used, Git does not commit changes directly from the working tree into the repository. Instead, changes are first registered in something called the index.
Think of it as a way of “confirming” your changes, one by one, before doing a commit (which records all your approved changes at once).
Some find it helpful to call it instead as the “staging area”, instead of the index.

working tree
A working tree is any directory on your filesystem which has a repository associated with it (typically indicated by the presence of a sub-directory within it named .git.).
It includes all the files and sub-directories in that directory.

The difference between a Git blob and a filesystem’s file is that a blob stores no metadata about its content. All such information is kept in the tree that holds the blob.

One tree may know those contents as a file named “foo” that was created in August 2004, while another tree may know the same contents as a file named “bar” that was created five years later.
In a normal filesystem, two files with the same contents but with such different metadata would always be represented as two independent files.

Why this difference? Mainly, it’s because a filesystem is designed to support files that change, whereas Git is not.
The fact that data is immutable in the Git repository is what makes all of this work and so a different design was needed.


In short, to quote Git Internal (very short extract)

A tree is a simple list of trees and blobs that the tree contains, along with the names and modes of those trees and blobs.

More specifically, the content of a tree is:

a very simple text file that list the :

  • mode,
  • type,
  • sha1 and
  • name

of each entities.

(Jakub Narębski details in the comments:

Actually the tree object is not a text file: for some reason it stores SHA-1 in binary format.

But:

The commit object uses textual format for SHA-1 of parents and of top tree.

)


The OP adds in the comments:

What I think I'm having a hard time comprehending is that every commit has a tree.

It sure has. A commit is a pointer to a *top level tree*, referenced by its SHA1.

And what triggers Git to create a tree initially?

Your first commit (the git init doesn't create a tree, just an empty Git repository)

According to Pro Git, there's a tie-in to the index, but no more information is provided.

You must be referring to the internal objects chapter:

Git normally creates a tree by taking the state of your staging area or index and writing a tree object from it.

So, as soon as you 'git add' some files (i.e. "staging them", or "adding them to the index"), you allow Git to create a tree from the index on your next commit.

alt text

This is essentially what Git does when you run the git add and git commit commands

  • it stores blobs for the files that have changed,
  • updates the index,
  • writes out trees,
  • and writes commit objects that reference the top-level trees and the commits that came immediately before them.

These three main Git objects — the blob, the tree, and the commit — are initially stored as separate files in your .git/objects directory.

alt text

VonC
What I get from this is mostly what I already understood. Namely, that a tree is loosely analogous to a directory. What I think I'm having a hard time comprehending is that every commit has a tree. And what triggers Git to create a tree initially? According to Pro Git, there's a tie-in to the index, but no more information is provided. Thanks.
Rob Wilkerson
@Rob: I have updated my answer to better address your questions.
VonC
@VonC: Actually the `tree` object is not a text file: for some reason it stores SHA-1 in binary format (the `commit` object uses textual format for SHA-1 of parents and of top tree).
Jakub Narębski
@Jakub: thank you for the precision. I have included it in the answer.
VonC
So what changed in the 3rd commit? Looks like the tree just references existing blobs. Also, what is the `bak` reference? Lastly, what would happen if the root tree had a subtree/directory containing `level2.txt` (just to extend this beyond the "trivial" use case)?
Rob Wilkerson
@Rob: in the third commit, you add a '`bak`' directory which reference a `test.txt` with a content like commit 1 but not like commit 2. And yet, that tree will point to that old version of `test.txt` instead of trying to store a delta of it compared to the immediate previous version. Hence masonk's remark about storing "the the entire state of the codebase at each commit"
VonC
Ah, okay. I didn't realize that "bak" was subtree. I do understand that pointers are used rather than duplicating blobs, so that helps.
Rob Wilkerson
@Rob: if the root tree contains a *new* file, it would depend on its content. (in your case, also a new file), then if the content of the new file already exist, it would actually reference a blob already stored. If not, the tree would reference a new blob for that '`level2.txt`' file.
VonC
+1 for patience and visuals.
Rob Wilkerson
+1  A: 

A tree represents the state of files on a disk. It is a timeless, immutable state of things.

A commit does not represent the state of files on disk. The job of commits is to represent the history of states - that is, commits link trees (states) together in chronological order. A single commit represents a moment in time when somebody committed the state of files on a disk to a permanent store. It does so by holding a pointer to a tree ("this is the state that the author committed"), a pointer to a prior commit ("this was the history before the author committed it"), and various metadata necessary to get a good history (timestamps, commit messages, authorship).


Edit: In reply to the comment, "So is every single commit, then, essentially a snapshot of the complete code base (using pointers where content hasn't changed)?": Every commit holds a pointer to a tree (which is a snapshot of the entire codebase), but really, since we are trying to be precise here, the answer is no: commits don't represent the state of a codebase. They represent a moment in time when a state of the codebase was entered in a permanent history. The tree that a commit points to, however, absolutely does represent the state of the entire codebase (because it is the top level tree - the tree rooted at the root of the repo).

However, for practical purposes, you can think of a commit as both a particular moment in time and a particular state of the codebase. If you ever saw a command that takes a "treeish" in the docs, this is what they're talking about: You can give it a tree or a commit, and if you give it a commit, it will just follow that through to the tree it points to. So yeah, git documentation, and when we're just using it without thinking about the implementation, you can kind of think of a commit as knowing the entire state of the repo (not just what changed).

Contrary to what you might have read from Joel Spoelsky's incorrect blog article, git doesn't store differences. It stores the entire state of the codebase at each commit. It just uses clever tricks with hashing to ensure that there is very little data redundancy in the object store.

masonk
So is every single commit, then, essentially a snapshot of the complete code base (using pointers where content hasn't changed)?
Rob Wilkerson
I replied in the answer
masonk
I think I'm _starting_ to get this, but I still haven't had that epiphany moment when it just clicks for me. My sense is that you're reply (esp. the edit) is pushing me beyond what I've read elsewhere. Thanks.
Rob Wilkerson
+2  A: 

1. How does Git detects that a tree needs to be created?

When you commit, git builds a tree hierarchy for the contents of the index and then builds a commit referencing the root of that tree hierarchy. After the git-add operation, the repository contains blob objects for all of the files added, and the index contains references to the blobs paired with path names. There are no tree objects yet.

When you commit (technically, during the write-tree operation), git recursively constructs a set of trees using the index information. It starts with the trees that contain only blobs, determines their identifiers, and writes the tree objects. Then it goes up each level and constructs the next set of trees, since this cannot happen before the subtree identifiers are known. Then it stores the root-level tree.

A commit operation is broken down into the write-tree and commit-tree steps. The write-tree step uses the current state of the index to identify and (if necessary) store all of the trees. The commit-tree step creates a new commit referencing all of the parent commits and the root tree that was just created.

2. What is stored beneath a tree at any given moment?

When you learn how to use git, the main focus is on the directed acyclic graph (DAG) of commits: Each commit contains a pointer to the previous commit, and you can go back in time by following these links. This makes sense, since the user interface is about commits, and trees are really secondary.

The trees also form a DAG, but the difference is that they do not represent the history of commits. Just like a blob, once a tree is created, its identifier will forever point to that tree with those contents. If any of the blobs or trees listed in a tree is modified or removed, it will have a new identifier, and the tree itself will have a new name in the next commit.

3. Is a new tree "revision" created any time a blob beneath that tree is modified?

Ok, let's say your repository looks like this:

foo/
  a.txt
  b.txt
bar/
  a.txt
  b.txt

and all of the files are empty. Then there are three objects in the repository, not counting the commit:

  1. The top-level tree:

    $ git cat-file -p ebf247ec5ebc97b12cd7a56db330141568edb946
    040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    bar
    040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    foo
    
  2. A tree with two blobs:

    $ git cat-file -p 2bdf04adb23d2b40b6085efb230856e5e2a775b7
    100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
    100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    b.txt
    
  3. The empty blob:

    $ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
    

First I'll explain why the trees foo and bar are stored by the same object, then I'll make a change and see what happens.

The SHA1 identifier of a tree is determined entirely by its content, just like a blob. Note that its name is not involved, which means that renaming a tree will recreate its parent, but the tree itself will not need to be restored. If you paste the above output to git mktree, git will respond with the object name of the resulting tree. Under the hood, mktree produces the SHA1 like this ruby code:

>> require 'digest/sha1'
>> sha1 = ['e69de29bb2d1d6434b8b29ae775ad8c2e48c5391'].pack 'H*'
>> contents = "100644 a.txt\0#{sha1}100644 b.txt\0#{sha1}"
>> data = "tree #{contents.length}\0#{contents}"
>>  Digest::SHA1.hexdigest(data)
"2bdf04adb23d2b40b6085efb230856e5e2a775b7"

Now I'm going to modify 'bar/b.txt' and examine the new set of trees:

$ echo hello > bar/b.txt
$ git add bar/b.txt
$ git write-tree
5fa578acc6695bf2af2975ed0ffa7ab448b52c22
$ git cat-file -p 5fa578acc6695bf2af2975ed0ffa7ab448b52c22
040000 tree 9a514e08691a9f636665a43a1c89dc1920dab0fa    bar
040000 tree 2bdf04adb23d2b40b6085efb230856e5e2a775b7    foo

Since nothing underneath 'foo' changed, it is stored as the exact same tree. For large structures, this is a huge space win. There is a new tree for 'bar', since I modified it:

$ git cat-file -p 9a514e08691a9f636665a43a1c89dc1920dab0fa
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
100644 blob ce013625030ba8dba906f756967f9e9ca394464a    b.txt
$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

Again, nothing in the tree objects say anything about revisions or commits. If a tree and its children are unchanged from one commit to the next, they will be represented by the same object. If there are two identical trees in the same commit, they will also be represented by the same object.

Regarding the index, there is only a minimal link between it and the trees. One important distinction is that the index stores blob names and paths, uses a flat list, and does not mention trees at all:

$ git ls-files -s
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       bar/a.txt
100644 ce013625030ba8dba906f756967f9e9ca394464a 0       bar/b.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       foo/a.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       foo/b.txt

When data is copied from a tree to the index, the tree structure is flattened. When data is copied from the index to the trees, it is rebuilt.

References

jleedev
I've focused on the 3rd answer since it provides an example (+1) and here's my first question: In the first commit, why only 3 objects (not including the commit)? I'd expect the "root" tree, the `foo` tree, the `bar` tree and an empty blob.
Rob Wilkerson
I expanded that a little. Try it out -- paste the tree listing in 2bdf04a into git-mktree, and then believe that foo and bar point to the same tree object.
jleedev