How does GIT decide what goes into a blob?

views:

284

answers:

+1 Q:

How does GIT decide what goes into a blob?

I am trying to understand how Git works better.

Given some arbitrary files and some arbitrary number of commits, how does git decide on how to split those files into blobs that are then uniquely identified with SHA-1 hashes?

I just did about 10 commits of perl/C/java code and text into new git repo and somehow git divided the files into little segments, how did it decide on how those segments should be divided?

+1 A:

All files go into a blob, but that doesn't necessarily mean that Git will store a file per a blob (Git has a highly efficient packed format that puts stuff together). If you are interested about the internals about Git's packing format, you're better off asking on their list, or reading their architectural documentation.

Edward Z. Yang 2009-04-23 00:55:25

OK, I have been reading the docs, but I am trying to speed up the learning process for me and the next person on SO, I haven't been able to answer this question by reading docs so far. Good advice about asking on the list, I'll do that if nothing comes up here.

Ville M 2009-04-23 01:02:20

This is the best I could find, http://eagain.net/articles/git-for-computer-scientists/ , but it doesn't really answer the question.

Ville M 2009-04-23 01:04:32

The mailing list is really, the only resource you should use for this particular question. (Or you could read the source code)

Arafangion 2009-04-23 01:12:07

+3 A:

I suggest you checkout some of the basic (that is "low level") references. For your particular question, see the section on the Git Object Model in the Git Community Book.

After that, you might be interested in reading Git from the Bottom Up (PDF) or the excellent Git Internals (PDF, US$9) for an understanding of low-level under-pinnings of Git (the "content-addressable file system" and directed acyclic graph relationships).

Pat Notz 2009-04-23 02:39:55

The following things become the blobs:

The contents of files
The contents of a directory
Commit messages
Signed tags

This page might help you visualize things:

http://eagain.net/articles/git-for-computer-scientists/

David Plumpton 2009-04-23 02:54:14

that's not true. only the contents of files become blobs. The others become objects (just like a blob is an object)

Pieter 2009-04-23 02:56:12

+7 A:

Git creates a blob for the content of each file, unless that same content already exists (in which case it reuses the blob). But there's more -- git also creates objects for every directory, commit, and signed tag. Every object is stored in .git/objects, until the repository is repacked (automatically or by running git gc), in which case some of the objects will be put together and deltified into a packfile (in .git/objects/pack).

It does not split the contents of a single file among multiple blobs, or little segments, as you seem to think.

Pieter 2009-04-23 03:00:03

OK, thanks for the first part, helps alot, on the last point, i think what is confusing me, is that browsing a particular file with GiTK File Viewer Git seems to know what commit particular parts of a newly combined file came from, that's where I got the "segments", how does Git make the determination and where those segments came from and how does it know that for example some often repeated line like "make" is part of unique segment and not a repeated change in it's own right?

Ville M 2009-04-23 17:15:37

I'm not sure what you mean. If you mean the differences it shows compared to the last revision, that's called a 'diff' and is calculated on the fly by comparing the two files. If you mean the blame view in git gui, that's done by some clever blame algorithm, see 'git blame' on the command line. It works roughly the same as a diff, but is done for each revision and also takes removed lines from other files into consideration.

Pieter 2009-04-24 18:53:48

OK, thanks, I think I undertandd now, what confused me coming from other inferior SCMs (SVN/CVS/perforce) was that they could typically not automatically diff against older revisions that had existed in differently named files in different directories unless branching had been done explicitly, which I had not done with Git in this case.So, I understand now that those are 2 separate issues, how the "clever" diff/blame algo works and how code is stored in blobs.I am marking yours as the answer, feel free to add detail if something else comes to mind for us Git newbies... Thanks

Ville M 2009-04-27 17:26:54

ansaurus

tags:

views:

answers:

How does GIT decide what goes into a blob?

related questions