Not a complete answer, but those comments (from AlBlue) might help on the space management aspect of the question:
There's a couple of things worth clarifying here.
Firstly, it is possible to have a bigger Git repository than an SVN repository; I hope I didn't imply that that was never the case. However, in practice, it generally tends to be the case that a Git repository takes less space on disk than an equivalent SVN repository would.
One thing you cite is Apache's single SVN repository, which is obviously massive. However, one only has to look at git.apache.org
, and you'll note that each Apache project has its own Git repository. What's really needed is a comparison of like-for-like; in other words, a checkout of the (abdera) SVN project vs the clone of the (abdera) Git repository.
I was able to check out git://git.apache.org/abdera.git
. On disk, it consumed 28.8Mb.
I then checked out the SVN version http://svn.apache.org/repos/asf/abdera/java/trunk/
, and it consumed 34.3Mb.
Both numbers were taken from a separately mounted partition in RAM space, and the number quoted was the number of bytes taken from the disk.
If using du -sh
as a means of testing, the Git checkout was 11Mb and the SVN checkout was 17Mb.
The Git version of Apache Abdera would let me work with any version of the history up to and including the current release; the SVN would only have the backup of the currently checked out version. Yet it takes less space on disk.
How, you may ask?
Well, for one thing, SVN creates a lot more files. The SVN checkout has 2959 files; the corresponding Git repository has 845 files.
Secondly, whilst SVN has an .svn
folder at each level of the hierarchy, a Git repo only has a single .git
repository at the top level. This means (amongst other things) that renames from one dir to another have relatively smaller impact in Git than in SVN, which admitedly, already has relatively small impact anyway.
Thirdly, Git stores its data as compressed objects, whereas SVN stores them as uncompressed copies. Go into any .svn/text-base
directory, and you'll find uncompressed copies of the (base) files.
Git has a mechanism to compress all files (and indeed, all history) into pack files. In Abdera's case, .git/objects/pack/
has a single .pack file (containing all history) in a 4.8Mb file.
So the size of the repository is (roughly) the same size as the current checked out code in this case, though I wouldn't expect that always to be the case.
Anyway, you're right that history can grow to be more than the total size of the current checkout; but because of the way that SVN works, it really has to approach twice the size in order to make much of a difference. Even then, disk space reduction is not really the main reason to use a DVCS anyway; it's an advantage for some things, sure, but it's not the real reason why people use it.
Note that Git (and Hg, and other DVCSs) do suffer from a problem where (large) binaries are checked in, then deleted, as they'll still show up in the repository and take up space, even if they're not current. The text compression takes care of these kind of things for text files, but binary ones are more of an issue. (There are administrative commands that can update the contents of Git repositories, but they have slightly higher overhead/administrative cost than CVS; git filter-branch is like svnadmin dump/filter/load
.)
As for the speed aspect, I mentioned it in my "How fast is git over subversion with remote operations?" answer (like Linus said in its Google presentation: (paraphrasing here) "anything involving network will just kill the performances")
And the GitBenchmark document mentioned by Jakub Narębski is a good addition, even though it doesn't deal directly with Subversion.
It does list the kind of operation you need to monitor on a DVCS performance-wise.
Other Git benchmarks are mentioned in this SO question.