Is it possible to store only a checksum of a large file in git?

views:

179

answers:

+1 Q:

Is it possible to store only a checksum of a large file in git?

I'm a bioinformatician currently extracting normal-sized sequences from genomic files. Some genomic files are large enough that I don't want to put them into the main git repository, whereas I'm putting the extracted sequences into git.

Is it possible to tell git "Here's a large file - don't store the whole file, just take its checksum, and let me know if that file is missing or modified."

If that's not possible, I guess I'll have to either git-ignore the large files, or, as suggested in this question, store them in a submodule.

+1 A:

How about storing the hashes in a text file, then giving the text file to git? Then you could write a hook that compared hashes, so every time you checked in or checked out, you could be notified of what was missing / different.

Not exactly what you want, and you would still have to maintain the text file manually.

Seth 2009-10-01 03:45:19

+4 A:

I wrote a script that does this sort of thing. You put file patterns in the .gitattributes file for large media that you don't want going in your git repo and it can store them on S3 instead. It's just a starting point, but I think it's usable if you're interested.

http://github.com/schacon/git-media

Maybe that will help you, or at least show you how it could be done and you can customize it for your specific needs.

Scott Chacon 2009-10-01 05:42:21

Amazon S3 wouldn't be an option for me (we're a little nervous about giving data to third parties). Are you planning on options that don't use third parties at some stage?

Andrew Grimm 2009-10-01 23:20:45

@Andrew: I modded the script to support storing files via SCP on your own private server, instead of on S3. Or you can store the files on a mapped network drive. Also I sped it up a bit. See here http://github.com/davr/git-media

davr 2010-07-19 17:42:09

+2 A:

In the upcoming release of git there would be 'refs/replace/' mechanism, which I think could be adapted for such purpose (assuming that the number of such large-media files and the number of its version isn't very large.)

In the slim fork of your project you would have (like Seth wrote) 'stub' files in place of your large media files, which as contents would have SHA-1 of a blob of large file (from "git hash-object -t blob <filename>").

Then in full fork of your project you would use "refs/replace/" mechanism to replace those 'stub' files by true contents (using git replace). Some hooks would be required to keep SHA-1 in 'stub' files in sync with actual large-media files.

Then if you want full clone, you fetch also from "refs/replace/" namespace; if you want slim clone, you don't fetch "refs/replace/".

Note: I haven't actually tested such setup; also this isn't yet available in git, unless you run 'master'

Jakub Narębski 2009-10-01 07:53:32

Very cool! I didn't know about this. Where does one get such information? The git mailinglist, Junio's blog? Is there some kind of an announcement service, "this week in git.git" or something like Jon Masters' daily LKML summary podcast? I find that it is sometimes hard to follow new features in Git, e.g. what's up with git-notes?

Jörg W Mittag 2009-10-01 17:12:52

I watch git mailing list, so it how I know. You can watch for RelNotes instead; the information about `refs/replaces/` is in http://git.kernel.org/?p=git/git.git;a=blob;f=Documentation/RelNotes-1.6.5.txt (so they are in git version 1.6.5; my mistake)

Jakub Narębski 2009-10-01 18:51:26

Errr... git version 1.6.5 is the **next** version to be released (as of 01-10-2009)

Jakub Narębski 2009-10-01 18:52:55

Also Junio C Hamano is submitting "What's in git.git ..." and "What's cooking in git.git ..." messages quite regularly; you can read them in RSS format thanks to http://gitrss.q42.co.uk (select "status" feed)

Jakub Narębski 2009-10-01 18:56:30

Is it spelt 'refs/replace', rather than 'refs/replaces'? Also, is documentation for this command available yet?

Andrew Grimm 2009-10-01 23:35:18

http://github.com/git/git/blob/master/Documentation/git-replace.txt

Andrew Grimm 2009-10-01 23:38:59

ansaurus

tags:

views:

answers:

Is it possible to store only a checksum of a large file in git?

related questions