It primarily depends on your use of the text changes. When the sequence includes both inserts and deletes it is theoretically impossible to know the details of each insert, since some of the symbols inserted may have subsequently been deleted. Therefore you have to choose what results you really want:
- For some purposes you must to know the exact sequence of changes even if some of the inserted symbols must be left as "?".
- For other purposes you must know exactly how the new text differs from the old but not the exact sequence in which the changes were made.
I will techniques to achieve each of these results. I have used both techniques in the past, so I know they are effective.
To get the exact sequence
This is more appropriate if you are implementing a history or undo log or searching for specific actions.
For these uses, the process you describe is probably best, with one possible change: Instead of "finding the mappings between the unknown symbols and the real ones", simply run the scan forward to find the text of each "Delete" then run it backward to find the text of each "Insert".
In other words:
Start with the initial text and process the changes in order. For each insert, insert '?' symbols. For each delete, remove the specified number of symbols and record them as the text deleted.
Start with the final text and process the changes in reverse order. For each delete, insert '?' symbols. For each insert, remove the specified number of symbols and record them as the text inserted.
When this is complete, all of your "Insert" and "Delete" change entries will have the associated text to the best of our knowledge, and any text that was inserted and immediately deleted will be '?' symbols.
To get the difference
This is more appropriate for revision marking or version comparison.
For these uses, simply use the text change information to compute a set of integer ranges in which changes might be found, then use a standard diff algorithm to find the actual changes. This tends to be very efficient in processing incremental changes but still gives you the best updates.
This is particularly nice when you paste in a replacement paragraph that is almost identical to the original: Using the text change information will indicate the whole paragraph is new, but using diff (ie. this technique) will mark only those symbol runs that are actually different.
The code for computing the change range is simple: Represent the change as four integers (oldstart, oldend, newstart, newend). Run through each change:
- If changestart is before newstart, reduce newstart to changestart and reduce oldstart an equal amount
- If changeend is after newend, increase newend to changeend and increase oldend an equal amount
Once this is done, extract range [oldstart, oldend] from the old document and the range [newstart, newend] from the new document, then use the standard diff algorithm to compare them.