tags:

views:

149

answers:

4

Let's say I create a repository, add x files to it and commit. Say the size is a Mb after the initial commit.

  • Is there any way to estimate how large the repository is going to be in one years time?

  • If the lines of code has increased by 10%, will the repository have grown accordingly?

  • How does number of commits, branches, tags etc. factor into the repository size?

  • Will 10000 commits the same year make the repository grow (noticeably) more than say 1000 commits?

  • Maybe my question is wrongly phrased?

+1  A: 

If you're worried about size mushroomin, go and clone some online projects and examine the size of their repositories. There are plenty of large projects to choose from with branches commits, etc, etc. My experience is that git & mercurial and pretty good about keeping size down, the size is a reflection more of the files that you put into them (and their size) rather than overhead.

Graham Perks
The repository I've created in Mercurial is around 70 Mb and spans over 10000 files. I was just a little bit worried that a couple of years down the road we'd be in trouble sizewise, but looking at other projects linked to here it doesn't look like we'll be any worse off.
MdaG
+5  A: 

Changes to a Mercurial repository are stored as either a complete file or as a compressed delta against the previous version:

http://mercurial.selenic.com/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F

Mercurial makes the decision about whether to store a complete file versus a delta based on the amount of changes made.

This means that it's not just adding lines of code that will increase the total size of a repository, but also:

  1. The number of changes made to existing code.
  2. The number of changes made to each file per commit.
  3. The number of files that are added and subsequently deleted.

Mercurial retains all deleted files. You could add a 1GB file to your repository and then delete it; the number of lines hasn't increased, but because the file remains in the repository, the repository will be considerably larger.

To answer your questions in turn:

  • I imagine it's feasible to roughly estimate the size of a repository after x months, assuming that you maintain a steady rate of change to the repository in total (ie. you add/remove/alter files at the same rate, changing roughly the same number of lines per commit).

  • Increasing the number of lines of code by 10% doesn't tell us how many lines were deleted/altered, so an increase in lines of code won't necessarily correspond to the same increase in repo size.

  • Tags don't affect Mercurial repo size by more than a handful of bytes. Nor do branches, until you start working on them, at which point they add the same overhead as working on the tip. Number of commits should be reasonably proportional to the repo size, assuming the same rate of change occurs.

  • Committing 10x as often probably won't increase the file size, as it is the rate of change that is the main influence on repo size, not number of commits.

Ant
The last point is not true I think. If you commit 10x as often, the size of the sum of the deltas should be roughly the same.
tonfa
And even when it's storing full files they're compressed full files. It's a compressed delta or a compressed full, but it's always compressed.
Ry4an
Thanks for the comments. I've amended the last point.
Ant
+3  A: 

Directly estimating the size in a year is obviously impossible, unless you have some idea of the number of commits and the final size of the work tree.

That said, git is pretty disk-space efficient. It absolutely never stores more than one copy of a given version of a file (this is internally represented as a blob), and older blobs are delta-compressed into packs. This means that it is very efficient at storing plain text, and very inefficient with large binary files. If your project is predominantly plain text, you almost certainly have nothing to worry about.

Branches and tags have essentially no effect on size. Sure, a branch's reflog could get up to a few KB, but that's nothing to worry about. Lightweight tags are pretty much just a stored SHA1, and annotated tags just add a tiny bit of metadata to that.

As for lines of code and number of commits, it's hard to say exactly. Generally the commits are a much bigger factor than the lines of code; you can have many many version of files all adding up (even represented as deltas) but the actual content only has to be stored once. This is backed up by the fact that work trees tend to be much than the .git directory. For example, my clone of git.git has a 17MB work tree and a 39MB .git directory. Other projects I examined had similar ratios.

More commits of equal size would certainly make the repository grow more, but taking 1000 commits and splitting them up into 10000 (encompassing the same changes) wouldn't make the repository much bigger. The commit objects themselves are small; it's the differences in the files that take space. You might see an initial spike in size, as commits are only periodically delta-compressed, but once git gc --auto gets triggered, those commits will get compressed back down.

The best generalization I can make is that a repository's .git directory will tend to grow at a rate proportional to the amount of delta per time, which in general should be proportional to work tree size and the rate at which people are modifying the project. This is of course so general as to be completely unhelpful, but there you are.

If you want to estimate, I'd just take some data over the first month or so, and try and fit a curve.

Jefromi
good explanation.
meder
Sounds like you need to repack your git.git. Mine is only 36MB after running `git gc`. In my experience, repositories with a smaller history tend to have the .git dir smaller than the worktree after a garbage collection.
Kevin Ballard
@Kevin: Oops. I'm surprised I managed to go that long without triggering a `gc --auto`. Thanks for catching that.
Jefromi
+2  A: 

Take a look at GitBenchmarks page on Git wiki, the section "Repository size benchmarks" and "Other benchmarks and references" (taking into account when the benchmark was made, and what versions it uses), in particular the entry at the end page:

  • DVCS Round-up: One System to Rule Them All? -- Part 3 by Robert Fendt on Linux Developer Network, from 27-01-2009, contains results of two synthetic benchmarks testing how a system acts under stress (number of commits in repository, or number of files comitted).

    The test system was a VM running Ubuntu 8.10, and the software versions used were SVK 2.0.2 (last is 2.2.3), darcs 2.1.0 (last is 2.4.4), monotone 0.42 (last is 0.48), Bazaar 1.10 (last is 2.2.1), Mercurial 1.1.2 (last is 1.6.4), and Git 1.6.1 (last is 1.7.3).

Jakub Narębski