views:

1033

answers:

5

Is there a distributed version control system (git, bazaar, mercurial, darcs etc.) that can handle files larger than available RAM?

I need to be able to commit large binary files (i.e. datasets, source video/images, archives), but I don't need to be able to diff them, just be able to commit and then update when the file changes.

I last looked at this about a year ago, and none of the obvious candidates allowed this, since they're all designed to diff in memory for speed. That left me with a VCS for managing code and something else ("asset management" software or just rsync and scripts) for large files, which is pretty ugly when the directory structures of the two overlap.

+2  A: 

I think it would be inefficient to store binary files in any form of version control system.

The better idea would be to store meta-data textfiles in the repository that reference the binary objects.

pobk
Thanks for your response. But yes, I did mean what I asked. I do need to version large files -- there is another class of software "enterprise asset management" that is basically VCS/Aperture/Version Cue on a server for media assets.
joelhardi
I think the point I was trying to make (not enough coffee I'm afraid) was that the majority of VCS systems aren't designed to version binary objects. As you say, they do in-memory diffs and store the delta... There's little point to versioning binaries since they are intrinsic.
pobk
A: 

Does it have to be distributed? Supposedly the one big benefit subversion has to the newer, distributed VCSes is its superior ability to deal with binary files.

Thanks for the answer, but yes, it does. I agree that SVN does handle binary files well -- which is part of what mystifies me that the VCSes I previously tested acted as if segfaulting on a 400 MB file is acceptable behavior.
joelhardi
+5  A: 

No free distributed version control system supports this. If you want this feature, you will have to implement it.

You can write off git: they are interested in raw performance for the Linux kernel development use case. It is improbable they would ever accept the performance trade-off in scaling to huge binary files. I do not know about Mercurial, but they seem to have made similar choices as git in coupling their operating model to their storage model for performance.

In principle, Bazaar should be able to support your use case with a plugin that implements tree/branch/repository formats whose on-disk storage and implementation strategy is optimized for your use case. In case the internal architecture blocks you, and you release useful code, I expect the core developers will help fix the internal architecture. Also, you could set up a feature development contract with Canonical.

Probably the most pragmatic approach, irrespective of the specific DVCS would be to build a hybrid system: implement a huge-file store, and store references to blobs in this store into the DVCS of your choice.

Full disclosure: I am a former employee of Canonical and worked closely with the Bazaar developers.

ddaa
Thanks very much for the reply. I did correspond with some Hg and BZR developers last year and what they said mirrors your assessment -- the BZR folks said "Hmm that's interesting, you could code it" and we considered it but the time cost didn't make sense compared to just using SVN or hacking ...
joelhardi
... up some hybrid solution where we're committing file hashes or something. The DVCS projects all seem to be heavily driven by the distributed FOSS development use case, unlike SVN and commercial products, which have a wider range of uses in mind. Hg and BZR are great projects, so too bad for me.
joelhardi
A: 

I'm having the same problem. How about using ZFS as storage and creating an SVN like API for use with Tortoise? ZFS is very quick when creating snapshots of any size files and very storage efficient. Please contact me if you would like to start a development team.
http://www.jacovosloo.info

A: 

Yes, Plastic SCM. It's distributed and it manages huge files in blocks of 4Mb so it's not limited by having to load them entirely on mem at any time. Find a tutorial on DVCS here: http://codicesoftware.blogspot.com/2010/03/distributed-development-for-windows.html

pablo