+2  A: 

This has been something I've been struggling with as well. I'll be keenly interested if anyone else offers a better answer than mine, but for the time being...

first off, there is http://www.xn--schler-dya.net/blog/2008/01/15/diffing_json_objects/

I personally have not been able to get this library to work, but your milage may vary.

The other alternative is to not try to solve the problem using the DIFF algorithm. It's quite innefficient, and depending on the problem, you may get better performance metrics just sending the whole blob of data, even if you do end up repeating yourself. This is true mainly of very small chunks of data. Obviously there's going to be a turning point as the data you need to transmit gets larger, but it's not going to be obvious where the turning point is, without some kind of measurement. The trick here, is that the bigger your data gets, the longer your diff calculation is going to take too. The turning point is only determined by the intersection of the two lines formed by each method's rate of growth, both of which are going to be linear or worse, depending on how your diff is implemented. In a worst case scenario, you may see an island in the middle where diff gets better performance, but then crosses back again for even larger data sets, and just plain sending it over the network is better again.

Next stop before trying diff, is by wrapping your data access in "get", "set" and "delete" methods that track the changes being made. The data you send over the wire would essentially be a sequential log of these method's usage, which you flush from the client side on each successful transmission. On the serverside you apply this log to your serverside data with serverside analogues to your data access methods. This is a somewhat lighter solution than a diff that doesn't require quite as much processing power.

Finally, if you're going to do diff, the most efficient way I can think of is if you can break your dataset down into discrete "chunks", each with a unique ID. Then when you run the diff, the courseness of the diff is exactly at the "chunk" level. that is, the only comparisons you'd make is ID to ID. If you change a chunk, give it a new id. The courser you can afford to make the diff algorithm, the less time it will take to run.

Alternatively, rather than assigning a new ID on change, you could simply run the diff to check whether a specific object has "changed", stop short as soon as you detect a change, and simply mark that chunk to be re-sent in its entirity, to update the chunk on the server side with the same ID. This could be made even more efficient if you have some kind of quick hashing algorithm for your chunks that you can use to quickly establish equality.

If the sequence of your chunks doesn't matter, or if you can store the sequence as a property of the chunks themselves, rather than modeled by the physical sequence of the chunks, then you can even key your chunks by ID. Then discovering the differences is simply a matter of listing the keys of object A, and looking them up on object B, and then Vice Versa. This is much simpler to implement than a "real" diff algorithm, it has O(a+b) performance which( I think ) is better than the worst case scenario for a real diff algorithm, which you're likely to get if you're trying to implement it yourself, or get a bad implementation.

Breton
"wrapping your data access in "get", "set" and "delete" methods" yeah, my other thought was storing a bunch of 'modification' objects whenever the JSON is modified, like an audit log, and then sending that over.
SCdF
wrt sending data: nah, it will get huge. Megabytes, potentially. There is a whole other problem around getting that data to the client, which might be like the reverse of this stuff (client builds whole object as it needs it from chunks from server) but that's another kettle of fish ;-)
SCdF
I've updated my answer with a couple more ideas.
Breton
I would go the change log route myself, depending on whats changing the data you might want to merge the entries.
Simon Buchan
Man, you've really thought about this :P
SCdF
A: 

why create a diff in the first place wouldn't it be more efficient to hook into the events that change your data and generate a changed dataset based on the changes? As your code will be embedded in a browser you'll be limited to JavaScript so generating a diff could kill performance of your app.

Damien McGivern
A: 

somewhere there's a grad thesis here, to find a solution that's both an efficient transmission of "what's new" and which plays nice with web architectures (caching and saving a minimum amount of state on the server).

I'm curious what's out there; I've been grappling with this very same question, the "chunk" idea in @Breton's answer is kinda where I'm going, but not sure how.

edit: wait -- I got this backwards, I thought you were talking about the server calculating diffs to send to the client.

Perhaps you could describe in vague details the structure of data on the client that is sent to the server. I wouldn't do it based on the JSON text itself if at all possible; do the diffs from within Javascript. If it's a list of items with known structure, keep track of which ones you've sent to the server and send the rest. If it's a more complicated dataset, not sure how to help you there.

Jason S
+3  A: 

EtherPad solved something like this by turning each source change into a mathematical operation that could be commutative, associative, etc. I would consider this a non-trivial problem, though, and EtherPad was able to form a business around the solution to this problem.

edit: interestingly enough, this is exactly the problem that a good DVCS like Git tackles, just not on the browser.

m104
Are you suggesting I port Git to JavaScript... :P
SCdF
Yes, immediately. Get on it! ;-) Seriously, though I'd like a browser based solution for this class of problem.
m104
+1  A: 

I publish today a small jQuery plugin doing a diff between two JS objects. http://plugins.jquery.com/project/jquery-diff

I use it into an application to send to the server backend only the modified data.

Marc Rutkowski