views:

415

answers:

6

Hi all,

I have inherited a single project in svn: 30Gb in over 300 000 files. There are tons of binary files in there mostly in an images folder. Operations like updating the entire project can be dramatically slow.

The team has evolved a process to only run update/switch on the specific folders they are working on and end up checking in broken code because "it works on my computer". Any one person's working copy can include out-of-date code, switched code, and forgotten-never-committed code. Also, minimal branching takes place.

My personal solution is a small bash checkout/build script at 5am every morning, however not everyone has the command line courage to even copy my solution and would rather the comfort of tortoise svn and the broken process.

Has anyone tried to tune such a large repository and can give advice? Are there any best practices I can implement for working with large repositories that I can ease everyone into?

P.S. externals don't seem to be a good idea and http://stackoverflow.com/questions/275147/svn-optimizations-to-keep-large-repositories-responsive doesn't apply here because I am dealing with a single project

P.P.S. This is currently being looked into also: http://www.ibm.com/developerworks/java/library/j-svnbins.html

+2  A: 

To deal with the unwieldy size, I'd consider splitting off binary data into another branch (or even completely removing it to be stored elsewhere), separate from the code. This should at least speed things up, especially if the data doesn't change often.

I understand the need for people to have a central location for their tools, data and libraries, but it just doesn't work well having one dump.

Dana the Sane
+1  A: 

I was a SCM manager in a similar situation. We had a project with over 200K files (mostly code) that was having some of the same issues. Our solution was to split the repository into two versions. One version being a development version and the other being a production version. We seeded the development version with the latest and greatest known working copy of all of the code. The developers started with that and made changes, checked in/out, etc. Once they felt things were stable, an administrator (in our case a build manager) merged the code and did test build to verify everything worked correctly. If everything passed it was good. If it didn't the build administrator would hunt down the developer and punish them severely. We had some of the same issues in the beginning where "It worked on my computer", etc., but before long those were worked out thanks to beatings and floggings.....

At particular points the development code (ALL WORKING CODE!!!!) was merged back into the production run and released to the customer.

Mark
Hi Mark,You answer describes our current setup and a common svn pattern, however it doesn't really answer my question. Our developers are not using the full working copy because it takes half hour to update everything.
Talesh
Sorry, about not answering the question. This is what we did though to solve pretty much the same situation you described. Within a few weeks it was rare that we had a situation like you described.
Mark
+3  A: 

We have two repositories, one for our code (changes frequently) and another for our binary data (very large, changes infrequently). It's a pain sometimes, but worth the better speed when working with code.

We also have a Ruby script that we call "daily update", checked into our repository, that we kick off on all of our development PCs via a Windows Scheduled Task, early every morning. It updates both checkouts to the latest version, then builds everything locally, so we're ready to go as soon as we arrive in the morning.

There are some hiccups that we haven't ironed out yet -- for example, when our automated tests run, there's currently a lag between when they check out the code and when they check out the data, so when we commit changes to both repositories, the CI server sometimes gets old code and new data, which causes test failures.

When we commit changes to the data repository, we usually just tell everyone else they need to update (we all sit in the same room). Otherwise, we don't usually update the data manually; we just let the daily update script keep it fresh.

Joe White
A: 

Is it possible to break the project into smaller projects that can be connected through some kind of plugin-system?

+5  A: 

Firstly, upgrade to SVN 1.6 on both client and server. The latest release notes mention a speedup for large files (r36389).

Secondly, this may not be too appropriate for you if you have to have the entire project in your working copy, but use sparse directories. We do this for our large repo, the first thing a client does is to checkout the top level directory only, then to get more data, use the repo browser to go to the desired directory and "update to this revision" on it. It works wonderfully on TortoiseSVN. 1.6 also has the 'reduce depth' option to remove directories you no longer need to work on.

If this isn't for you, you can still do an update on parts of the working copy. Update tends to be slow the more files you have (on Windows that is, NTFS seems to be particularly poor with the locking strategy used for updating. Bert Huijben noticed this and suggested a fix - TBA with the 1.7 release, but you could rebuild your current code with his 'quick fix'.

An alternative could be to change your filesystem, if you can reformat, you could try the ext2 IFS driver, but I'm sure you'd be cautious of that!

Last option - turn off your virus scanner for .svn firectories, and also for the repository on the server. If you're running Apache on the server, make sure you have keep alives on for a short time (to prevent re-authentication from occurring). Also turn off indexing on your working copy directories and shadow copy too. (the last doesn't help much, but you may see a better improvement that I did, turning AV off on the server boosted my SVN response 10x though).

gbjbaanb
Thanks for all the suggestions. I will have to profile them to see which works best.
Talesh
@Talesh - how did you profile? http://stackoverflow.com/questions/2684893/is-there-an-svn-benchmark
ripper234
+1  A: 

I'll keep it brief:

  • Upgrade to the latest version (1.6.x). 1.5.x had speed optimizations as well.
  • Make sure everyone is using the same version of TortoiseSVN which is built against the exact version of the server. We had many problems with guys updating on whim and then getting weird problems.
  • Externals work between servers, repositories and folders on the same repo. So you can move the binaries to another repo/server altogether and just link to them with externals.
  • Restructure the folders so that you can sparse checkout the project and still be able to work productively. Basically everyone checks out tops folder + children only then selectively "update to revision" the folders they need to checkout fully.
  • Create scripts that export, build then commit (or prompt to commit). I have such scripts for my use. Before committing, I run the script and it exports my wc and then builds. NOTE: This will copy the full wc! So this is useful with sparse checkouts where the data size is small(er).
  • Consider moving the binaries off of the repo (I don't recommend this, but it might be the sanest solution to get productivity up again).
  • Remember, exporting doesn't create a wc, which means you save 50% disk space compared to checkouts. So if you restructure such that binaries and infrequently updated items may be exported instead of checkout, it would encourage more people to "get the full thing" and not try to skim some of it.
Ash