tags:

views:

284

answers:

4

I am trying to understand how Git works better.

Given some arbitrary files and some arbitrary number of commits, how does git decide on how to split those files into blobs that are then uniquely identified with SHA-1 hashes?

I just did about 10 commits of perl/C/java code and text into new git repo and somehow git divided the files into little segments, how did it decide on how those segments should be divided?

+1  A: 

All files go into a blob, but that doesn't necessarily mean that Git will store a file per a blob (Git has a highly efficient packed format that puts stuff together). If you are interested about the internals about Git's packing format, you're better off asking on their list, or reading their architectural documentation.

Edward Z. Yang
OK, I have been reading the docs, but I am trying to speed up the learning process for me and the next person on SO, I haven't been able to answer this question by reading docs so far. Good advice about asking on the list, I'll do that if nothing comes up here.
Ville M
This is the best I could find, http://eagain.net/articles/git-for-computer-scientists/ , but it doesn't really answer the question.
Ville M
The mailing list is really, the only resource you should use for this particular question. (Or you could read the source code)
Arafangion
+3  A: 

I suggest you checkout some of the basic (that is "low level") references. For your particular question, see the section on the Git Object Model in the Git Community Book.

After that, you might be interested in reading Git from the Bottom Up (PDF) or the excellent Git Internals (PDF, US$9) for an understanding of low-level under-pinnings of Git (the "content-addressable file system" and directed acyclic graph relationships).

Pat Notz
A: 

The following things become the blobs:

  • The contents of files
  • The contents of a directory
  • Commit messages
  • Signed tags

This page might help you visualize things:

http://eagain.net/articles/git-for-computer-scientists/

David Plumpton
that's not true. only the contents of files become blobs. The others become objects (just like a blob is an object)
Pieter
+7  A: 

Git creates a blob for the content of each file, unless that same content already exists (in which case it reuses the blob). But there's more -- git also creates objects for every directory, commit, and signed tag. Every object is stored in .git/objects, until the repository is repacked (automatically or by running git gc), in which case some of the objects will be put together and deltified into a packfile (in .git/objects/pack).

It does not split the contents of a single file among multiple blobs, or little segments, as you seem to think.

Pieter
OK, thanks for the first part, helps alot, on the last point, i think what is confusing me, is that browsing a particular file with GiTK File Viewer Git seems to know what commit particular parts of a newly combined file came from, that's where I got the "segments", how does Git make the determination and where those segments came from and how does it know that for example some often repeated line like "make" is part of unique segment and not a repeated change in it's own right?
Ville M
I'm not sure what you mean. If you mean the differences it shows compared to the last revision, that's called a 'diff' and is calculated on the fly by comparing the two files. If you mean the blame view in git gui, that's done by some clever blame algorithm, see 'git blame' on the command line. It works roughly the same as a diff, but is done for each revision and also takes removed lines from other files into consideration.
Pieter
OK, thanks, I think I undertandd now, what confused me coming from other inferior SCMs (SVN/CVS/perforce) was that they could typically not automatically diff against older revisions that had existed in differently named files in different directories unless branching had been done explicitly, which I had not done with Git in this case.So, I understand now that those are 2 separate issues, how the "clever" diff/blame algo works and how code is stored in blobs.I am marking yours as the answer, feel free to add detail if something else comes to mind for us Git newbies... Thanks
Ville M