views:

781

answers:

8

I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.

A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.

The control of versioning is not an issue, this is not a codebase per-se.

I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?

My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.

Thanks

Edit : This should be UNIX based :)

+1  A: 

You have a lot of options:

  • You can try out to set up replicated DB to store data.
  • Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
  • Use git repos with max compressions and sync between them using some scripts
  • Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)
Dev er dev
+12  A: 

Have you tried Unison?

I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.

Vinko Vrsalovic
This is *almost* right, and I especially like the link to the website. With Unison, it does not look at the local filesystem for the solution first, say in the parent directory or a sister directory (I'd even like to define this). If size, name, mod-time, checksum are the same, grab that instead...
Spedge
Why don't you use instead links for this, instead of replicating these JDKs and whatnot? It doesn't seem right to be worrying about duplicating things that certainly don't need duplication. Unison WILL sync links... so that would work, and relieve you of some space needs and some headaches
Vinko Vrsalovic
+5  A: 

Sounds like a job for BitTorrent.

For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.

Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.

Actual recipe will depend on your need. For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.

Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.

Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.

Rince and repeat, daily.

BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.

ADEpt
I like this idea. Torrenting would certainly clear up bandwidth problems, and downloading things from itself would be genius. An add-on question to this would be, however, how do I work out what I need to sync at any one time? I'd need to build a list on the changes...not sure if I can do that :S
Spedge
The way I see it, you can think in terms of the usual copy/move operations, substituting bittorrent in place of actual file transfers.I'll edit my solution to reflect this.
ADEpt
A: 

Sounds like a job for Foldershare

Echostorm
+2  A: 

How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?

Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).

warren
A: 

Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.

Alexander
+1  A: 

Thanks guys. Although there's not been the definitive answer that suits the environment I'm in, there's been some excellent suggestions on where to start and good ideas in how to improve the setup.

I appreciate all the answers. Thanks!

Spedge
+1  A: 

Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.

bbqchickenrobot