views:

179

answers:

3

I'm a bioinformatician currently extracting normal-sized sequences from genomic files. Some genomic files are large enough that I don't want to put them into the main git repository, whereas I'm putting the extracted sequences into git.

Is it possible to tell git "Here's a large file - don't store the whole file, just take its checksum, and let me know if that file is missing or modified."

If that's not possible, I guess I'll have to either git-ignore the large files, or, as suggested in this question, store them in a submodule.

+1  A: 

How about storing the hashes in a text file, then giving the text file to git? Then you could write a hook that compared hashes, so every time you checked in or checked out, you could be notified of what was missing / different.

Not exactly what you want, and you would still have to maintain the text file manually.

Seth
+4  A: 

I wrote a script that does this sort of thing. You put file patterns in the .gitattributes file for large media that you don't want going in your git repo and it can store them on S3 instead. It's just a starting point, but I think it's usable if you're interested.

http://github.com/schacon/git-media

Maybe that will help you, or at least show you how it could be done and you can customize it for your specific needs.

Scott Chacon
Amazon S3 wouldn't be an option for me (we're a little nervous about giving data to third parties). Are you planning on options that don't use third parties at some stage?
Andrew Grimm
@Andrew: I modded the script to support storing files via SCP on your own private server, instead of on S3. Or you can store the files on a mapped network drive. Also I sped it up a bit. See here http://github.com/davr/git-media
davr
+2  A: 

In the upcoming release of git there would be 'refs/replace/' mechanism, which I think could be adapted for such purpose (assuming that the number of such large-media files and the number of its version isn't very large.)

In the slim fork of your project you would have (like Seth wrote) 'stub' files in place of your large media files, which as contents would have SHA-1 of a blob of large file (from "git hash-object -t blob <filename>").

Then in full fork of your project you would use "refs/replace/" mechanism to replace those 'stub' files by true contents (using git replace). Some hooks would be required to keep SHA-1 in 'stub' files in sync with actual large-media files.

Then if you want full clone, you fetch also from "refs/replace/" namespace; if you want slim clone, you don't fetch "refs/replace/".

Note: I haven't actually tested such setup; also this isn't yet available in git, unless you run 'master'

Jakub Narębski
Very cool! I didn't know about this. Where does one get such information? The git mailinglist, Junio's blog? Is there some kind of an announcement service, "this week in git.git" or something like Jon Masters' daily LKML summary podcast? I find that it is sometimes hard to follow new features in Git, e.g. what's up with git-notes?
Jörg W Mittag
I watch git mailing list, so it how I know. You can watch for RelNotes instead; the information about `refs/replaces/` is in http://git.kernel.org/?p=git/git.git;a=blob;f=Documentation/RelNotes-1.6.5.txt (so they are in git version 1.6.5; my mistake)
Jakub Narębski
Errr... git version 1.6.5 is the **next** version to be released (as of 01-10-2009)
Jakub Narębski
Also Junio C Hamano is submitting "What's in git.git ..." and "What's cooking in git.git ..." messages quite regularly; you can read them in RSS format thanks to http://gitrss.q42.co.uk (select "status" feed)
Jakub Narębski
Is it spelt 'refs/replace', rather than 'refs/replaces'? Also, is documentation for this command available yet?
Andrew Grimm
http://github.com/git/git/blob/master/Documentation/git-replace.txt
Andrew Grimm