Best Practice for synchronizing common distributed data

views:

371

answers:

+1 Q:

Best Practice for synchronizing common distributed data

I have a internet application that supports offline mode where users might create data that will be synchronized with the server when the user comes back online. So because of this I'm using UUID's for identity in my database so the disconnected clients can generate new objects without fear of using an ID used by another client, etc. However, while this works great for objects that are owned by this user there are objects that are shared by multiple users. For example, tags used by a user might be global, and there's no possible way the remote database could hold all possible tags in the universe.

If an offline user creates an object and adds some tags to it. Let's say those tags don't exist on the user's local database so the software generates a UUID for them. Now when those tags are synchronized there would need to be resolution process to resolve any overlap. Some way to match up any existing tags in the remote database with the local versions.

One way is to use some process by which global objects are resolved by a natural key (name in the case of a tag), and the local database has to replace it's existing object with this the one from the global database. This can be messy when there are many connections to other objects. Something tells me to avoid this.

Another way to handle this is to use two IDs. One global ID and one local ID. I was hoping using UUIDs would help avoid this, but I keep going back and forth between using a single UUID and using two split IDs. Using this option makes me wonder if I've let the problem get out of hand.

Another approach is to track all changes through the non-shared objects. In this example, the object the user assigned the tags. When the user synchronizes their offline changes the server might replace his local tag with the global one. The next time this client synchronizes with the server it detects a change in the non-shared object. When the client pulls down that object he'll receive the global tag. The software will simply resave the non-shared object pointing it to the server's tag and orphaning his local version. Some issues with this are extra round trips to fully synchronize, and extra data in the local database that is just orphaned. Are there other issues or bugs that could happen when the system is in between synchronization states? (i.e. trying to talk to the server and sending it local UUIDs for objects, etc).

Another alternative is to avoid common objects. In my software that could be an acceptable answer. I'm not doing a lot of sharing of objects across users, but that doesn't mean I'd NOT be doing it in the future. Which means choosing this option could paralyze my software in the future should I need to add these types of features. There are consequences to this choice, and I'm not sure if I've completely explored them.

So I'm looking for any sort of best practice, existing algorithms for handling this type of system, guidance on choices, etc.

Your problem is quite similar to versioning systems like SVN. You could take example from those.

Each user would have a set of personal objects plus any shared objects that they need. Locally, they will work as if they own the all the objects.

During sync, the client would first download any changes in the objects, and automatically synchronize what is obvious. In your example, if there is a new tag coming from the server with the same name, then it would update the UUID correspondingly on the local system.

This would also be a nice place in which to detect and handle cases like data committed from another client, but by the same user.

Once the client has an updated and merged version of the data, you can do an upload.

There will be to round trips, but I see no way of doing this without overcomplicating the data structure and having potential pitfalls in the way you do the sync.

Sklivvz 2009-08-12 12:47:47

As a totally out of left-field suggestion, I'm wondering if using something like CouchDB might work for your situation. Its replication features could handle a lot of your online/offline synchronisation problems for you, including mechanisms to allow the application to handle conflict resolution when it arises.

Evan 2009-08-12 14:18:48

Only if I could study how they choose to do replication and rewrite it for my environment. I can't use CouchDB on both ends choices are limited on the client to only one database. Therefore, even if I used CouchDB in the server side I still have to sync to a non-CouchDB.

chubbard 2009-08-12 22:36:49

Yes, you're right. Unless you can run CouchDB at both ends, there's no advantage. Pity.

Evan 2009-08-12 23:42:45

I would be interested in hearing CouchDB's algorithm. How do they detect changes from one set to another. I kinda wanted this to be more of a discussion of algorithms for synchronizations as opposed to products that do sync.

chubbard 2009-08-14 20:34:45

Depend on what application semantics you want to offer to users, you may pick different solutions. E.g., if you are actually talking about tagging objects created by an offline user with a keyword, and wanting to share the tags across multiple objects created by different users, then using "text" for the tag is fine, as you suggested. Once everyone's changes are merged, tags with the same "text", like, say "THIS IS AWESOME", will be shared.

There are other ways to handle disconnected updates to shared objects. SVN, CVS, and other version control system try to resolve conflicts automatically, and when cannot, will just tell user there is a conflict. You can do the same, just tell user there have been concurrent updates and the users have to handle resolution.

Alternatively, you can also log updates as units of change, and try to compose the changes together. For example, if your shared object is a canvas, and your application semantics allows shared drawing on the same canvas, then a disconnected update that draws a line from point A to point B, and another disconnected update drawing a line from point C to point D, can be composed. In this case, if you keep those two updates as just two operations, you can order the two updates and on re-connection, each user uploads all its disconnected operations and applies missing operations from other users. You probably want some kind of ordering rule, perhaps based on version number.

Another alternative: if updates to shared objects cannot be automatically reconciled, and your application semantics does not support notifying user and asking user to resolve conflicts due to disconnected updates, then you can also use version tree to handle this. Each update to a shared object creates a new version, with past version as the parent. When there are disconnected updates to a shared object from two different users, two separate children versions/leaf nodes result from the same parent version. If your application's internal representation of state is this version tree, then your application's internal state remains consistent despite disconnected updates, and you can handle the two branches of the version tree in some other way (e.g. letting user know of branches and create tools for them to merge branches, as in source control systems).

Just a few options. Hope this helps.

OverClocked 2009-08-13 13:01:35

I often see people cite SVN, CVS, etc as examples of synchronization, but no real substantive discussion about HOW they do it. How do they compute the differences on a large scale? How do they keep track of what's changed since last sync?To some degree I think CVS and SVN are not applicable. SVN doesn't directly handle situations where I create a file, and another user creates a file with the same name. The user has to decide by merging. In my case I need to be more graceful because his old tag isn't the same ID as the one in the remote store. SVN isn't a object graph. No shifting IDs.

chubbard 2009-08-14 17:01:42

Your app's requirement is important. You may want to look at the Coda file system and other prior art. I think you have 3 options. Option 1: ask user to help merge conflicts and cleanup application state. Option 2: always allow concurrent operations. E.g. using globally unique IDs for new files, but allow users to create files with the same name in the same shared directory. Option 3: automatically split into two branches/versions of the object on concurrent updates. User can merge the branches whenever, but before the merge, the app's internal state is "consistent".

OverClocked 2009-08-17 14:03:53

I'd be happy to brain storm more if you describe more of what you are trying to do...

OverClocked 2009-08-17 14:04:29

maybe we could brainstorm over email since these comments are limited to 600 chars. I've been continuing to work on this, and I think my solution has been to simply synchronize on the local objects along with what they connect to. Those shared/global objects will just get resolved because the server will send those to the client on synchronization. At that time he'll drop his local version and choose to use the server one. But, I think further discussion might yield more insight. I'm game if you are. I'm not sure how I can send you a private message or contact you directly.

chubbard 2009-08-26 20:58:42

ansaurus

tags:

views:

answers:

Best Practice for synchronizing common distributed data

related questions