views:

252

answers:

5

I have a web application where a client side editor is editing a really really large text which is known on the server side.

The client can make any kind of modifications to this text.

What is the most network-efficient way to transmit the result difference in a way that the server understands? Also, since this will happen on client side (Javascript), I would also like it to be 'fast' (or at least not noticeably slow)

Some scenarios:

  • User modifies ONE character
  • User modifies several sentences in random positions
  • User erases everything and results in a blank text.

I cannot use diff-like syntax since it's not network efficent, it checks lines, where examples 1 and 3 will produce horrible differences (especially the last one, where the result will be more than the old itself).

Anyone has experience in this matter? User operates on a really large set of data - around 3-5MB of text, and uploading the whole "new" content is a big no-no.

To be clear, I'm looking for a "protocol" of transfer, string comparison is not the issue.

A: 

A simple approach, assuming that you know the copy on the server isn't going to change, would just be to send a list of edits (deletions and additions), with the deletions represented as a start and end index, and the additions represented as a start index and the text to insert.

If you have more than a simple diff algorithm to work with (I'm not sure exactly what you mean by "string comparison is not the issue"), you could also detect moved or copied chunks of text, and send those as the start and end index of the moved or copied piece of text, as well as the destination to insert it.

Note that you'll need to make sure to keep track of whether your indices refer to the original document, or the document as edited so far. An easy approach to avoid this problem is to always perform the edits from the end of the document towards the beginning; then earlier edits won't affect the offsets specified by later edits.

For an example of an approach like this, see the ed format that diff -e outputs. This is basically input that could be fed into the ed line-oriented text editor. If you want the absolute smallest diffs to send across you may want to do character based indexing rather than line based indexing, but the same basic approach could work.

Brian Campbell
A: 

You could just send changes every 500ms, so, whatever changes were made in the last 500ms would be sent, but you only send data when there was a change.

In this you could then send the position of the changed word(s) and just send the entire word, but I would have the position be from the front of the text.

It won't be several sentences worth, but there may be several words involved, but, if you send them in order of change then the result should be consistent.

James Black
What if the user cuts the first half of the document and pastes it below the second half?
Kev
The cut would be one operation and the paste the other, as doing both in 500ms is unlikely.
James Black
A: 

Any edits the user's performing can be efficiently broken down into: delete from X for length Y; insert at X text "whatever". X and Y are offsets in characters from the start of the text; Y is a number of characters; "whatever" is any string of characters. You say you need no help computing the diff, but an example is here, except it's richer in its output than you need, but does identify "removals and insertions", so, just change the output part.

The exact format in which you send the data to the server can be tuned, but I don't think there's much mileage in doing that -- pending measurement, I'd start by sending the commands as D for delete or I for insert, the numbers in decimal, the inserted string in quoted form. Once you have some statistics on actual transfers being performed, you can see how much overhead is in the numbers (decimal vs binary) and quotes, but I suspect that may not be all that meaningful (if it proves to be, there are all sort of things you can try, such as giving offsets from the latest point of insertion or deletion, rather than always from the start, to make things faster).

You can sample what the user is doing every few seconds, and just send the incremental changes over those last few seconds (if any) -- this way, each packet you're sending will be small, and if the net connection or the user's computer/browser crash, the user won't have lost much work.

Alex Martelli
+2  A: 

I'm not very familiar with this topic but I can point you to an open source (Apache License 2.0) project which may be very useful.

It is a Diff, Match and Patch library written in several languages, including JavaScript, from a Google engineer and it is used in several online collaborative editing services.

Here are a list of resources:

brianpeiris
Oh that google talk is amazing!
LiraNuna
Yeah, Google Tech Talks has lots of hidden gems. I subscribe to their RSS feed[1], which is how I stumbled upon this fascinating and in-depth talk. [1]: http://gdata.youtube.com/feeds/base/users/googletechtalks/uploads
brianpeiris
A: 

Because there are so many ways to do edits--even within short periods of time like 500ms--including dragging and dropping, or cutting and pasting, large sections of text around within the document or from outside it--I don't know if there's going to be something that will cover all scenarios really well. This is certainly a non-answer to your question at face value, but I would consider carefully the trouble of developing and maintaining something like this compared to changing the interface to restrict the text size and breaking existing texts into smaller pieces.

Maybe that's not possible in your situation, but if it is, I would guess it would be much less trouble in the end to dodge the issue in this way and just send full documents after an edit.

Kev
It's not real time, I need to compare two big strings - original is known to the server and a changed on client side. user will press a button when done.
LiraNuna
I know, I was commenting on other people's proposed solutions of monitoring what the user was doing at short intervals to make it easier to calculate the diff. I'm saying you can make it complicated to calculate the diff well pretty much instantaneously.
Kev