views:

93

answers:

2

I've got an app that has about 10 types of objects. There will be potentially a few thousand object instances of each type. These lists of objects need to stay synchronized between apps running on different machines. If an object is added, changed or deleted, that needs to propagate to the other machines.

This will be a star topology -- there is a central master, and the rest are clients.

I DO have the concept of a session, so can store data about each client.

Is there a good design pattern to follow for this? Even better, is there a (template based?) library that would handle asking the container what has changed since client X came by and getting that delta to send out?

Right now I'm thinking every object-type container has an update counter. When something is added/changed/removed, the update counter is incremented, and the changed object(s) are tagged with that value. Each client will save the value of the update counter when it gets an update. Later it will come back and ask for any changes since it's update counter value. Finally, deletes are kept as tombstone records (although I'm not exactly sure when to clear them out).

One thing that makes this harder is clients can come and go without the central server necessarily knowing, although I guess there could be a timeout concept (if the server haven't heard from a client in 5 minutes, it assumes the client is gone)

Is this a well-known pattern? Any additional suggestions?

+1  A: 

How you implement synchronization very much depends on your needs. Do the changes need to be sent to the clients, or is it sufficient that the clients checks if an object is up to date whenever it uses the objects? How bout using the Proxy pattern? This pattern allows you to create a proxy-implementation of your objects that can check if they are up to date or not, do update if they are not, and then return the result. I would do this by having a lastChanged timestamp on the objects on the master and a lastUpdated timestamp on the client objects. If latency is an issue checking if an object is up-to-date on each call is probably not a good idea. Consider having a separate thread that queries the master for changed objects and marks them "dirty". This could dramatically reduce the network traffic as well.

You could also look into the Observer pattern and Publish/Subscribe.

larsm
Latency will likely be an issue, so I don't think the proxy pattern works. And because there will be so many objects (10K maybe?) and possibly even 100 clients, the Publish/Subscribe seems like it would add a ton of overhead with so many individual updates being fired out. I'm more looking for something where a client can show up and say, 'what's changed since I was last here' and get a bulk list of data returned.
DougN
A: 

An option that might be simple to implement and still pretty efficient is to treat the pile of objects as an opaque blob and use librsync to synchronize them. It sounds like all of the updates flow one direction, from master to clients, and there's probably some persistent representation of the objects on the clients -- a file or something. I'm assuming it's a file for the rest of this answer, though any sequence of bytes can be used.

The way it would work is that each client would generate a librsync "signature" of its local copy of the blob and send that signature to the master. The signature is about 1% of the size of the blob. The master would then use librsync to compute a delta between that signature and the current data, and send the delta to the client, which would use librsync to apply the delta to its local copy of the blob.

The librsync API is simple, and the signature/delta data transfer is relatively efficient.

If that's not workable, it may still be useful to take a more manual "delta-based" approach, to avoid having to do per-object versioning. Each time the master makes a change, it should log that change to a journal, recording what was done and to which object. Versioning is done at the whole-database level, so in effect a version number is assigned to each journal entry.

When a client connects, it should send its version of the whole object collection, and the server can then respond with the contents of the journal between the client's version and the newest entry. If updates on a given object are done by completely replacing the object contents, then you can optimize this by filtering out all but the most recent version of each object. If the master also keeps track of which versions it has sent to which client, it can know when it is safe to discard old journal entries. Even if it doesn't track that, you can still discard old journal entries according to some heuristic (probably just age) and if you receive a connection from a client whose last version is older than your oldest journal entry, then you just have to send the entire set of objects to that client.

swillden