tags:

views:

106

answers:

4

I have a large set of files (50GB) and they're on two hosts a long distance away and I want to put them in several Git repositories so that each one is a mirror repo of the repo on the other side. But I don't want to transfer the files over the network because it will take a long time (50-60 hours) and it's unnecessary since the files are already on both sides.

My idea was to create a Git repo on each side, add all the files on each side to the local repo and then git-pull from one to the other. I thought Git would be smart enough to know that the files (objects) are identical and not transfer them. But it doesn't appear to be because on just a small sample, it takes a long time to do the pull (mostly in the "Unpacking objects" stage) and it maxes out the network connection between the two. So it seems to me that it's transferring the Git objects unnecessarily.

Does anyone have ideas on how to do this without actually transferring the files?

Thanks!

+1  A: 

That's interesting, this could work since the contents of the large files is the same (I assume) and should create the same object file on both ends.

Doing test on two repos on my local machine shows that the same file in different repositories will have the same SHA id.

Check and see if the SHA ids of your actual files are identical in both repositories. If they are, then we need to work out why they might be transferred anyway, if not then find out why not.

Alex Brown
Yes, they are identical. After adding a file to each side, I did git ls-tree on the git tree that contains the file (I assume this is the right way to do this) and the SHA id is d88cbbbe54e7cd688d399f4e2b4f8195fcf2c4a7 for the blob on both sides.
troyh
A: 

I used sneakernet (well, carnet): Take one of your local, downstream git trees and burn the whole thing to DVD. On the remote side, copy the DVD to disk. Then, if necessary, edit the .git/config's [remote "origin"] config section so that the repo can still get to its upstream.

Wayne Conrad
I'd do that, but the other host is on the other side of the country. Besides, burning 7 DVDs, mailing them to someone there and having them copy the DVDs would take at least 48 hours, not much of a time-saver.
troyh
A: 

What protocol are you using, git or Http?

Git is slow when using the http protocol. If your only option is http and you need a DVCS, you could try Mercurial.

If all you need to do is synchronize two remote folders, you could take a look at Beyond Compare

Lieven
It's slow because my network upload speed is slow (2Mbps), not because of Git and I'm not using HTTP.
troyh
In that case, I'd look at Beyond Compare. Beyond Compare is capable of checking various properties of both files without actually opening them (wich would defeat the purpose) to determine if they're changed. If that doesn't work for you, I think knittl's answer should be spot on. Somehow, expecting git to know that two repo's, wich happen to have the same file structure, are mirrors of each other seems to simple. I'd assume it could be made to work but you'd need to manually adjust Trees, Blobs and the like.
Lieven
+1  A: 

you need the commits to be the same. even if the tree ids are the same, commit ids can differ.

what i can think of now, is the following:

make the (initial) commit on one side. note its hash. find the hash in the .git/objects/ folder. copy the file to the other pc. if the other pc has a tree with the same id, it should work

knittl
This seems to work! But you also need to edit the .git/refs/heads/master file to contain the commit ID from the other side. But the goal is to end up with 2 repos where one can be a mirror (a backup) of the other. So I want to be able to do work on the first side and have changes pulled. So if you then add another file on the first side and then do a pull, expecting the new file to be transferred, it tells you to do a 'git reset --hard'. If you do that, it does seem to work. Now you have to do 'git reset --hard' every time you add a file on the first side and want to pull from it, though.
troyh
`git reset --hard` sounds strange. git could want that for the first time, but everytime? can you give me the exact error message?
knittl