Distributed version control for HUGE projects - is it feasible?

views:

482

answers:

+9 Q:

Distributed version control for HUGE projects - is it feasible?

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

The thing is - our SVN repository is HUGE. The software itself has a 15 years old legacy and has survived several different source control systems already. There are over 68,000 revisions (changesets), the source itself takes up over 100MB and I cant even begin to guess how many GB the whole repository consumes.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane. And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

+1 A:

You'd split your one huge repository into lots of smaller repositories, each for each module in your old repo. That way people would simply hold as repositories whatever SVN projects they would have had before. Not much more space required than before.

Epaga 2010-03-19 10:09:56

No, the whole thing is one huge project that compiles to one .EXE. Yes, it's a monolithic beast.

Vilx- 2010-03-19 10:11:10

No, does not work. You dont want anything that requires signiciant storage on client side then. If you get that large (by toring fo rexample images etc.), the storage requires more than a normal workstation has anyway to be efficient.

You better go with something centralized then. Simple math - it simlpy is not feasible to have tond of gb on every workstation AND be efficient there. It simply makes no sense.

TomTom 2010-03-19 10:11:43

That's what I was worried about. Then again - I know that Linux kernel development uses GIT because other vcs' just didn't scale. I wonder how it is there.

Vilx- 2010-03-19 10:15:12

I'd question this - most workstations are coming with large hard drives - linux repo is 800MB, you'd be hard pressed to get bigger than this and that's peanuts on a newish hard drive.

Paddy 2010-03-19 11:02:02

Well, linux did not scale but linus has some REALLY funny requirements, like a very distributed team to start with. In addition, 800mb is not exactly a large archive.

TomTom 2010-03-19 11:04:37

@TomTom: That is a 800mb git repository. It would be much bigger if it were a SVN or other centralized VCS. I would be shocked to find a development machine that has less than 300gb of space which should be enough to have dozens of clones of the full repository. In reality most will be working with 2 or 3 clones.

mfperzel 2010-03-19 15:46:50

@TomTom: even if it *should* take several GB, it would only need to be pulled *once* on each station. After that push/pull are fast and lightweight. Workstation nowadays have hundreds of GB. And even if you plenty of space to keep dozens of *complete* Mercurial/Git repos on one workstation, you typically don't need to. The most complicated workflow I've seen to work on the NetBeans codebase where requiring... Two repos! Diffs are lightweight. If you think that Mercurial or Git are slow when working with big repos it's because you've never tried them. Linux is bigger than the OP's repo btw...

Webinator 2010-03-20 19:19:32

+8 A:

100MB of source code is less than the Linux kernel. Changelog between Linux kernel 2.6.33 and 2.6.34-rc1 has 6604 commits. Your repository scale doesn't sound intimidating to me.

Linux kernel 2.6.34-rc1 uncompressed from .tar.bz2 archive: 445MB
Linux kernel 2.6 head checked out from main Linus tree: 827MB

Twice as much, but still peanuts with the big hard drives we all have.

Tadeusz A. Kadłubowski 2010-03-19 10:20:23

Indeed, that was my second thought too. So... how does it work? The whole Linux Kernel repository should be an order of magnitude bigger. Do people really download ALL THAT to start hacking?

Vilx- 2010-03-19 10:22:27

Yes, the Linux Kernel IS a beast, and it takes a lot less than a gig. The only problem would be the initial convertion, as it would take long, but then it would be a breeze.

MeDiCS 2010-03-19 10:24:20

@Vilx: Linux uses Git, which in turn uses compression and diffs for storage. Git is very good at avoiding wasted space.

MeDiCS 2010-03-19 10:26:30

Oh, and the total git history of Linux is almost 200 000 commits. KDE has more than 1 000 000 commits and they consider migrating to git too.

Tadeusz A. Kadłubowski 2010-03-19 10:37:56

OK, nice. And how fast is the remote clone operation? As fast as just downloading the compressed data, or is there more overhead?

Vilx- 2010-03-19 10:41:46

@Tadeusz: to be fair I guess KDE will split their repositories, like gnome did.

tonfa 2010-03-19 11:29:29

+2 A:

Do you need all history? If you only need the last year or two, you could consider leaving the current repository in a read-only state for historical reference. Then create a new repository with only recent history by performing svnadmin dump with the lower bound revision, which forms the basis for your new distributed repository.

I do agree with the other answer that 100MB working copy and 68K revisions isn't that big. Give it a shot.

Si 2010-03-19 10:23:48

In the codebase I work on, oh yes you need all the history (and I don't have it all - first SVN commit was "Initial code" - a big code dump) if you want to be able to tell why a particular line of code is the way it is. Depends on your code churn, of course. I seldom need to see past the most recent delta that affects a line - typically only when only whitespace has changed.

Bernd Jendrissek 2010-04-01 14:56:37

+1 A:

I am using git on a fairly large c#/.net project (68 projects in 1 solution) and the TFS footprint of a fresh fetch of the full tree is ~500Mb. The git repo, storing a fair amount of commits locally weighs in at ~800Mb. The compaction and the way that storage works internally in git is excellent. It is amazing to see so many changes packed in to such a small amount of space.

Leom Burke 2010-03-19 10:26:06

+3 A:

You say you're happy with SVN... so why change?

As far as distributed version control systems go, Linux uses git and Sun use Mercurial. Both are impressively large source code repositories, and they work just fine. Yes, you end up with all revisions on all workstations, but that's the price you pay for decentralisation. Remember storage is cheap - my development laptop currently has 1TB (2x500GB) of hard disk storage on board. Have you tested pulling your SVN repo into something like Git or Mercurial to actually see how much space it would take?

My question would be - are you ready as an organisation to go decentralised? For a software shop it usually makes much more sense to keep a central repository (regular backups, hook ups to CruiseControl or FishEye, easier to control and administer).

And if you just want something faster or more scalable than SVN, then just buy a commercial product - I've used both Perforce and Rational ClearCase and they scale up to huge projects without any problems.

2010-03-19 10:47:19

Of course we're not ready. I don't know if we'll ever be. I'm just curious. :)

Vilx- 2010-03-19 11:00:01

+8 A:

Distributed version control for HUGE projects - is it feasible?

Absolutely! As you know, Linux is massive and uses Git. Mercurial is used for some major projects too, such as Python, Mozilla, OpenSolaris and Java.

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

Yes. And if you're happy with Subversion now, you're probably not doing much branching and merging!

The thing is - our SVN repository is HUGE. [...] There are over 68,000 revisions (changesets), the source itself takes up over 100MB

As others have pointed out, that's actually not so big compared to many existing projects.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane.

Both Git and Mercurial are very efficient at managing the storage, and their repositories take up far less space than the equivalent Subversion repo (having converted a few). And once you have an initial checkout, you're only pushing deltas around, which is very fast. They are both significantly faster in most operations. The initial clone is a one-time cost, so it doesn't really matter how long it takes (and I bet you'd be surprised!).

And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

Disk space is cheap. Developer productivity matters far more. So what if the repo takes up 1GB? If you can work smarter, it's worth it.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

It is probably worth reading up on how projects using Mercurial such as Mozilla managed the conversion process. Most of these have multiple repos, which each contain major components. Mercurial and Git both have support for nested repositories too. And there are tools to manage the conversion process - Mercurial has built-in support for importing from most other systems.

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

That makes it easier, as you only need the one repository.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

Git is designed for raw speed. The on-disk format, the wire protocol, the in-memory algorithms are all heavily optimised. And they have developed sophisticated workflows, where patches flow from individual developers, up to subsystem maintainers, up to lieutenants, and eventually up to Linus. One of the best things about DVCS is that they are so flexible they enable all sorts of workflows.

I suggest you read the excellent book on Mercurial by Bryan O'Sullivan, which will get you up to speed fast. Download Mercurial and work through the examples, and play with it in some scratch repos to get a feel for it.

Then fire up the convert command to import your existing source repository. Then try making some local changes, commits, branches, view logs, use the built-in web server, and so on. Then clone it to another box and push around some changes. Time the most common operations, and see how it compares. You can do a cmoplete evaluation at no cost but some of your time.

gavinb 2010-03-19 10:48:56

Hmmm... I suppose I could give it a try on my local machine. Heh, the irony! :D

Vilx- 2010-03-19 11:00:33

The biggest (open) hg repo I know is netbeans: http://hg.netbeans.org/main/ (160k revs, working dir is > 100MB, I don't know the exact number). There are a couple people who have huge converted repo, but it's not public.

tonfa 2010-03-19 11:34:49

From my experience, Mercurial is pretty good at handling a large number of files and a huge history. The drawback is that you shouldn't check-in files bigger than 10 Mb. We used Mercurial to keep an history of our compiled DLL. It's not recommend to put binaries in a source countrol but we tried it anyway (it was a repository dedicated to the binaries). The repository was about 2 Gig and we are not too sure that we will be able to keep doing that in the future. Anyway, for source code I don't think you need to worry.

Simon T. 2010-03-19 14:32:18

You can put files of any size in a Mercurial repository -- it doesn't care. It's true that it warns you when you add a file bigger than 10 MB. This is because most source files are well below that limit, so adding a larger file may indicate a mistake, like adding a tarball instead of an unpacked directory (`hg add foo.tar.gz` instead of `hg add foo/`). The problem with big files is that the consume bandwidth and disk space when cloning. When merging, they also consume *memory*, perhaps 3 times as much as the size of the file.

Martin Geisler 2010-03-20 10:06:20

Git can obviously work with a project as big as yours since, as you pointed, Linux kernel alone is bigger.

The challenge (don't know if you manage big files) with Mercurial and Git is that they can't manage big files (so far).

I've experience moving a project your size (and around for 15 years too) from CVS/SVN (a mix of the two actually) into Plastic SCM for distributed and centralized (the two workflows happening inside the same organization at the same time) development.

The move will never be seamless since it's not only a tech problem but involves a lot of people (a project as big as yours probably involves several hundreds of developers, doesn't it?), but there are importers to automate migration and training can be done very fast too.

pablo 2010-03-19 23:28:00

+1 A:

Don't worry about repository space requirements. My anecdote: when I converted our codebase from SVN to git (full history - I think), I found that the clone used less space than just the WVN working directory. SVN keeps a pristine copy of all your checked-out files: look at $PWD/.svn/text-base/ in any SVN checkout. With git the entire history takes less space.

What really surprised me was how network-efficient git is. I did a git clone of a project at a well-connected place, then took it home on a flash disk, where I keep it up to date with git fetch / git pull, with just my puny little GPRS connection. I wouldn't dare to do the same in an SVN-controlled project.

You really owe it to yourself to at least try it. I think you'll be amazed at just how wrong your centralised-VCS-centric assumptions were.

Bernd Jendrissek 2010-04-01 15:06:33

ansaurus

tags:

views:

answers:

Distributed version control for HUGE projects - is it feasible?

related questions