How does git save space and is fast at the same time?

views:

229

answers:

+4 Q:

How does git save space and is fast at the same time?

I just saw the first git tutorial at http://blip.tv/play/Aeu2CAI

How does git store all the versions of all the files and still be more economical in space than subversion which saves only the latest version of the code?

I know this can be done using compression but that would be at the cost of speed, but this also says that git is much faster (though where is gains the max is the fact that most of its operations are offline).

So, my guess is that

git compresses data extensively
it is still faster because uncompression + work is still faster than network_fetch + work

Am I correct? even close?

+4 A:

Not a complete answer, but those comments (from AlBlue) might help on the space management aspect of the question:

There's a couple of things worth clarifying here.

Firstly, it is possible to have a bigger Git repository than an SVN repository; I hope I didn't imply that that was never the case. However, in practice, it generally tends to be the case that a Git repository takes less space on disk than an equivalent SVN repository would.
One thing you cite is Apache's single SVN repository, which is obviously massive. However, one only has to look at git.apache.org, and you'll note that each Apache project has its own Git repository. What's really needed is a comparison of like-for-like; in other words, a checkout of the (abdera) SVN project vs the clone of the (abdera) Git repository.

I was able to check out git://git.apache.org/abdera.git. On disk, it consumed 28.8Mb.
I then checked out the SVN version http://svn.apache.org/repos/asf/abdera/java/trunk/, and it consumed 34.3Mb.
Both numbers were taken from a separately mounted partition in RAM space, and the number quoted was the number of bytes taken from the disk.
If using du -sh as a means of testing, the Git checkout was 11Mb and the SVN checkout was 17Mb.

The Git version of Apache Abdera would let me work with any version of the history up to and including the current release; the SVN would only have the backup of the currently checked out version. Yet it takes less space on disk.

How, you may ask?

Well, for one thing, SVN creates a lot more files. The SVN checkout has 2959 files; the corresponding Git repository has 845 files.

Secondly, whilst SVN has an .svn folder at each level of the hierarchy, a Git repo only has a single .git repository at the top level. This means (amongst other things) that renames from one dir to another have relatively smaller impact in Git than in SVN, which admitedly, already has relatively small impact anyway.

Thirdly, Git stores its data as compressed objects, whereas SVN stores them as uncompressed copies. Go into any .svn/text-base directory, and you'll find uncompressed copies of the (base) files.
Git has a mechanism to compress all files (and indeed, all history) into pack files. In Abdera's case, .git/objects/pack/ has a single .pack file (containing all history) in a 4.8Mb file.
So the size of the repository is (roughly) the same size as the current checked out code in this case, though I wouldn't expect that always to be the case.

Anyway, you're right that history can grow to be more than the total size of the current checkout; but because of the way that SVN works, it really has to approach twice the size in order to make much of a difference. Even then, disk space reduction is not really the main reason to use a DVCS anyway; it's an advantage for some things, sure, but it's not the real reason why people use it.

Note that Git (and Hg, and other DVCSs) do suffer from a problem where (large) binaries are checked in, then deleted, as they'll still show up in the repository and take up space, even if they're not current. The text compression takes care of these kind of things for text files, but binary ones are more of an issue. (There are administrative commands that can update the contents of Git repositories, but they have slightly higher overhead/administrative cost than CVS; git filter-branch is like svnadmin dump/filter/load.)

As for the speed aspect, I mentioned it in my "How fast is git over subversion with remote operations?" answer (like Linus said in its Google presentation: (paraphrasing here) "anything involving network will just kill the performances")

And the GitBenchmark document mentioned by Jakub Narębski is a good addition, even though it doesn't deal directly with Subversion.
It does list the kind of operation you need to monitor on a DVCS performance-wise.

Other Git benchmarks are mentioned in this SO question.

VonC 2010-05-19 20:55:38

@VonC:: thanks!

Lazer 2010-05-23 13:52:05

+5 A:

I guess that you wanted to ask how it is possible for git clone (full repository + checkout) to be smaller than checked-out sources in Subversion. Or did you mean something else?

^{This question is answered in the comments}

Repository size

First you should take into account that along checkout (working version) Subversion stores pristine copy (last version) in those .svn subdirectories. Pristine copy is stored uncompressed in Subversion.

Second, git uses the following techniques to make repository smaller:

each version of contents of a file is stored only once; this means that if you have only two different versions of some file in e.g. 10 revisions (10 commits), git stores only those two versions, not 10.
objects (and deltas, see below) are stored compressed; text files used in programming compresses really well (around 60% of original size, or 40% reduction in size from compression)
after repacking, objects are stored in deltified form, as a difference from some other version; additionally git tries to order delta chains in such way that delta consist mainly of deletions (in usual case of growing files it is in recency order); also IIRC deltas are compressed as well.

Performance (speed of operations)

First, any operation that involves network would be much slower than a local operation. Therefore for example comparing current state of working area with some other version, or getting a log (a history), which in Subversion involves network connection and network transfer, and in Git is a local operation, would of course be much slower in Subversion than in Git. BTW. this is the difference between centralized version control systems (using client-server workflow) and distributed version control systems (using peer-to-peer workflow), not only between Subversion and Git.

Second, if I understand it correctly, nowadays the limitation is not CPU but IO (disk access). Therefore it is possible that the gain from having to read less data from disk because of compression (and being able to mmap it in memory) overcomes the loss from having to decompress data.

Third, Git was designed with performance in mind (see e.g. GitHistory page on Git Wiki):

The index stores stat information for files, and Git uses it to decide without examining files if the files were modified or not (see e.g. core.trustctime config variable).
The maximum delta depth is limited to pack.depth, which defaults to 50. Git has delta cache to speed up access. There is (generated) packfile index for fast access to objects in packfile.
Git takes care to not touch files it doesn't have to. For example when switching branches, or reqinding to other version, Git updates only files that changed. The consequence of this philosophy is that Git does support only very minimal keyword expansion (at least out of the box).
Git uses its own version of LibXDiff library, nowadays also for diff and merge, instead of calling external diff / external merge tool.
Git tries to minimize latency, which means good perceived performance. For example it outputs first page of "git log" as fast as possible, and you see it almost immediately, even if generating full history would take more time; it doesn't wait for full history to be generated before displaying it.
When fetching new changes, Git checks what objects do you have in common with server, and sends only (compressed) differences in the form of thin packfile. Admittedly Subversion can (or perhaps by default it does) also send only differences when updating.

I am not a Git hacker, and I probably missed some techniques and tricks that Git uses for better performance. Note however that Git heavily uses POSIX (like memory mapped files) for that, so the gain might be not as large on MS Windows.

Jakub Narębski 2010-05-19 21:25:21

@Jakub Narębski: yes, that is what I wanted to know, and also how git can still be faster even after using compression - for each operation it will have to use decompression and then proceed - a process where subversion will be quicker because it does not need to decompress first?

Lazer 2010-05-20 04:54:34

@Jakub: completely missed your answer. Excellent. +1

VonC 2010-05-23 11:52:33

@Jakub Narębski: thanks for the details. Specially for mentioning that "nowadays the limitation is not CPU but IO (disk access)". I should have thought of that.

Lazer 2010-05-23 13:10:19

While answering to http://stackoverflow.com/questions/3224059/why-cant-i-find-good-references-request-for-paper-version-control-system-git, I just realize that Git Wiki (http://git.wiki.kernel.org/index.php/Main_Page) is a broken link. Your GitBenchmark link is also unavailable. Strange...

VonC 2010-07-11 18:19:07

I hope it is some temporary glitch with Git Wiki.

Jakub Narębski 2010-07-11 22:33:25

@VonC: Git Wiki is up now (problem with upgrade).

Jakub Narębski 2010-07-12 07:33:13

ansaurus

tags:

views:

answers:

How does git save space and is fast at the same time?

Repository size

Performance (speed of operations)

related questions