views:

9378

answers:

5

Hi there. I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:

  1. Copy the binary files by hand.
    • Pro: Not sure.
    • Contra: I am strongly against this, as it increases the likelihood of errors when setting up a new site/migrating the old one. Builds up another hurdle to take.
  2. Manage them all with git.
    • Pro: Removes the possibility to 'forget' to copy a important file
    • Contra: Bloats the repository and decreases flexibility to manage the code-base and checkouts/clones/etc will take quite a while.
  3. Separate repositories.
    • Pro: Checking out/cloning the source code is fast as ever, and the images are properly archived in their own repository.
    • Contra: Removes the simpleness of having the one and only git repository on the project. Surely introduces some other things I haven't thought about.

What are your experiences/thoughts regarding this?

Also: Does anybody have experience with multiple git repositories and managing them in one project?

Update: The files are images for a program which generates PDFs with those files in it. The files will not change very often(as in years) but are very relevant to a program. The program will not work without the files.

Update2: I found a really nice screencast on using git-submodule at GitCasts.

+4  A: 

In my opinion, if you're likely to often modify those large files, or if you intend to make a lot of git clone or git checkout, then you should seriously consider using another git repository (or maybe another way to access thoses files).

But if you work like we do, and if your binary file are not often modified, then the first clone/checkout will be long, but after that it should be as fast as you want (considering your users keep using the first cloned repo they had).

claferri
And, separate repos won't make the checkout time any shorter, since you still have to check out both repos!
Emil
+16  A: 

If the program won't work without the files it seems like splitting them into a separate repo is a bad idea. We have large test suites that we break into a separate repo but those are truly "auxiliary" files.

However, you may be able to manage the files in a separate repo and then use git-submodule to pull them into your project in a sane way. So, you'd still have the full history of all your source but, as I understand it, you'd only have the one relevant revision of your images submodule. The git-submodule facility should help you keep the correct version of the code in line with the correct version of the images.

Here's a good introduction to submodules from Git Book.

Pat Notz
This is interresting. I'll look into that.
pi
+6  A: 

I would use submodules (as Pat Notz) or two distinct repositories. If you modify your binary files too often, then I would try to minimize the impact of the huge repository cleaning the history:

I had a very similar problem several months ago: ~21Gb of mp3's, unclassified (bad names, bad id3's, don't know if I like that mp3 or not...), and replicated in three computers.

I used an external harddisk with the main git repo and I cloned it into each computer. Then, I started to classify them in the habitual way (pushing, pulling, merging... deleting and renaming many times).

At the end, I had only ~6Gb of mp3's and ~83Gb in the .git dir. I used git-write-tree and git-commit-tree to create a new commit, without commit ancestors, and started a new branch pointing to that commit. The "git log" for that branch only showed one commit.

Then, I deleted the old branch, kept only the new branch, deleted the ref-logs, and run "git prune": after that, my .git folders weighted only ~6Gb...

You could "purge" the huge repository from time to time in the same way: Your "git clone"'s will be faster.

Banengusk
I did something similar once where I had to split one repository which I merged accidentally into two distinct ones. Interesting usage pattern though. :)
pi
Would this be the same as just: rm -f .git; git init; git add . ; git commit -m "Trash the history."
Pat Notz
Yes, it is the same only in my mp3 case. But sometimes you don't want to touch your branches and tags (no space reduction in public repositories) but you want to speed up a "git clone/fetch/pull" of only a branch (less space for dedicated-to-that-branch repositories).
Banengusk
+7  A: 

Be warned, git currently has a bug where files over 2gb cannot be added to a repository on 32-bit machines.

Herge
A: 

SVN seems to handle binary deltas more efficiently than git

Had to decide on a versioning system for documentation (jpgs, pdfs, odts). Just tested adding a jpeg and rotating it 90 degrees 4 times (to check effectiveness of binary deltas). git's repository grew 400%. SVN's repository grew by only 11%

So it looks like SVN is much more efficient with binary files

So my choice is git for source code and SVN for binary files like documentation.

Tony Diep
You just needed to run "git gc" (repacking and garbage collecting) after adding those 4 files. Git doesn't immediately compress all the added content, so that you will have a group-of-files compression (which is more efficient in terms of size) and won't have a slowdown of separately compressing every single added object out there. But even without "git gc", git would have done the compression for you eventually, anyway (after it noticed, that enough unpacked objects have accumulated).
nightingale