views:

127

answers:

2

I'm in the process of a writing an application, which needs to synchronize a file-structure between a client and a (http) server.

The file-structure is essentially a list of file-paths where each path is a string connected with 1 or more data-block ids (256-bit reference to the actual data-block). A data-block can be referenced by several files so there's a n-m relation between paths and ids. Right now it is just a list of paths with there ids, but it can easily be converted to the tree structure which the paths represent, if that's necessary for the synchronization.

I'm looking for a data structure which allows me to sync this data efficiently. Mainly achieving two goals:

  1. A change in one file should not force the client to send the entire file-strcuture to the server, only a small subset of it.
  2. If many files are changed these changes should be grouped together. E.g. so that 1000 changes doesn't result in 1000 requests to the server.

As you see, the goals are a bit conflicting and I'm therefore looking for something which finds a good middleground between them. The second goal can easily be achieved by grouping several changes into one http-request, but then the processing required by the server (to parse all changes requested by the HTTP-request) should be very inexpensive, computing wise.

I should also mention that there could be several clients synchronizing the same structure on the server. It must therefore be easy to detect the changes by one client and then syncrhonize it to an other client (i.e. it's not just an upload to the server).

I'm certainly not the first one doing something like this, so I assume there are some smart solutions available. For instance, I guess both Dropbox and Subversion have similar requirements when they sync their meta-data. Does anyone happen to know how they have implemented it?

+2  A: 

Any reason not to use rsync? If you need to programmatically control it, there is librsync.

The subversion source code is open, so you could check that. Also, I know that Mercurial has a pretty smart wire protocol for minimizing traffic.

John Paulett
Thanks, but there are a couple of problems with rsync.* My client has to communicate over http (otherwise I'll run into firewall-issues).* It still doesn't solve the problem of what data-structure to use. A small change in in the data might result in large amounts of changes sent (even if I use rsync), if I'm not careful about the structure of the data.
Yrlec
Noticed your edit now...I will check out Mercurial. Hopefully it can give me some insights.
Yrlec
A: 

I've decided to solve this using a transaction-log. Each clients saves all changes to the tree to a transaction-log (in addition to the local db of the tree which it also keeps), which it periodically syncs with the server. The log is just a list of entries with file->datablock-id's and a timestamp.

When the log has been sent to the server it is removed from the client. Before uploading the log it also asks for logs written by other clients to the same tree. These logs are then merged into the local tree.

The log itself will be stored on the server using Azure Blob Storage. The server can periodically remove old entries from the log (if it grows to big).

This way the clients efficiently can communicate its' changes with each other while the server doesn't have to any expensive processing on each request.

Yrlec