views:

155

answers:

7

Hello,

I've got a list containing positions of text additions and deletions, like this:

     Type   Position   Text/Length
1.   +      2          ab          // 'ab' was added at position 2
2.   +      1          cde         // 'cde' was added at position 1
3.   -      4          1           // a character was deleted at position 4

To make it more clear, this is what these operations will do:

    1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    ---------------------------------
    t | e | x | t |   |   |   |   |  
1.  t | a | b | e | x | t |   |   |  
2.  c | d | e | t | a | b | e | x | t
3.  c | d | e | a | b | e | x | t |

The number of actions can be reduced to:

     Type   Position   Text/Length
1.   -      1          1           // 't' was deleted at position 1
2.   +      1          cdeab       // 'cdeab' was added at position 1

Or:

     Type   Position   Text/Length
1.   +      1          cdeab       // 'cdeab' was added at position 1
2.   -      6          1           // 't' was deleted at position 6

These actions are to be saved in my database and in order to optimize this: how can I reduce the number of actions that are to be done to get the same result? Is there a faster way than O(n*n)?

Note that these actions are chronological, changing the order of the actions will give another result.

+1  A: 

The "diff" tools used in source code control systems use algorithms that produce the minimum edit needed to transform one piece of source code to another - it might be worth investigating them. I think most of them are based (eventually) on this algorithm, but it's a while since I did any reading on this subject.

anon
I have read about these algorithms [diff, Levenshtein], but I'm sure finding the difference between two gigantic pieces of text is slower than merging a couple of actions like I mentioned above
Harmen
And often they don't produce a minimal edit either, just something that is *good enough* (tm).
Lasse V. Karlsen
+3  A: 

Not a solution, just some thoughts:

  • Rule 1: if two consecutive operations don't have overlapping ranges, they can be swapped (with positions adjusted)
  • Rule 2: two consecutive inserts or removals at the same position can be joined
  • Rule 3: when an insert is followed by a removal that is completely contained in the insert, they can be joined

I don't see a straightforward algorithm for the shortest solution. However, an heuristic approach using Rule 1 + 2 might be:

  • move operations "up" unless
    • you'd violate Rule 1
    • you'd move an insert before a removal
    • the position is less than that of that predecessor
  • join consecutive inserts/removals at the same position

Applied to the sample, this would mean:

 + 2 ab
 + 1 cde
 - 4 1

Rule 1 (2x):

+ 2 ab
- 1 1   // position adjusted by -3
+ 1 cde

.

- 1 1  
+ 1 ab  // position adjusted
+ 1 cde

Rule 2:

- 1 1
+ 1 cdeab // watch correct order!

A primitive implementation will be O(N*N) - basically, a bubble sort with additonal stopping conditions. I'm not sure about beating down that complexity, since standard algorithms are of no use here due to having to adjust the position.

However, you might be able to improve things notably - e.g. you don't need a "full sort"

peterchen
Great! I couldn't have thought about this, thank you! After some testing, I'll decide which answer to accept :]
Harmen
+1  A: 

I believe that this can be done considerably faster than O(n²) on average (it is likely that input can be engineered not to allow fast analysis). You can regard consecutive additions or deletions as sets. You can analyze one operation at a time, and you will have to do some conditional transformations:

  • If an addition follows an addition, or a set of additions, it might
    • touch (one or more of) the previous addition(s): then, you can unite these additions
    • not touch: you can order them (you will have to adjust the positions)
  • If a deletion follows an addition, or a set of additions, it might
    • only delete characters from the addition: then, you can modify the addition (unless it would split an addition)
    • only delete characters not from the set of additions: then, you can move the deletion to a position before the set of additions, and perhaps unite additions; after that, the set of deletions before the current set of additions might have to be applied to the additions before that
    • do both: then, you can first split it into two (or more) deletions and apply the respective method
  • If a deletion follows a deletion, or a set of deletions, it can:
    • touch (one or more of) the previous deletion(s): then, you can unite these deletions
    • not touch: you can order them (you will have to adjust the positions
    • in any case, you then have to apply analysis of the newly formed deletions on the previous additions
  • If an addition follows a deletion, no transformation is needed at this point

This is just a first rough draft. Some things may have to be done differently, e.g., it might be easier or more efficient to always apply all deletions, so that the result is always only one set of deletions followed by one set of additions.

Svante
Thank for this description, it's really helpful! I'll put this to the test in some examples; it may take some time to understand it ^^
Harmen
+2  A: 

Make a binary tree representing the document before and after applying all the changes. Each node represents either original text or inserted/deleted text; the latter kind of node includes both the amount of original text to delete (possibly 0) and the string of text to insert (possibly empty).

Initially the tree has just one node, "0 to end: original text". Apply all the changes to it merging changes as you go wherever possible. Then walk the tree from beginning to end emitting the final set of edits. This is guaranteed to produce the optimal result.

  • Applying an insert: Find the appropriate point in the tree. If it's in the middle of or adjacent to inserted text, just change that node's text-to-insert string. Otherwise add a node.

  • Applying a delete: Find the starting and ending points in the tree—unlike an insert, a delete may cover a whole range of existing nodes. Modify the starting and ending nodes accordingly, and kill all the nodes in between. After you're done, check to see if you have adjacent "inserted/deleted text" nodes. If so, join them.

The only tricky bit is making sure you can find points in the tree, without updating the entire tree every time you make a change. This is done by caching, at each node, the total amount of text represented by that subtree. Then when you make a change, you only have to update these cached values on nodes directly above the nodes you changed.

This looks strictly O(n log n) to me for all input, if you bother to implement a balanced tree and use ropes for the inserted text. If you ditch the whole tree idea and use vectors and strings, it's O(n2) but might work fine in practice.

Worked example. Here is how this algorithm would apply to your example, step by step. Instead of doing complicated ascii art, I'll turn the tree on its side, show the nodes in order, and show the tree structure by indentation. I hope it's clear.

Initial state:

*: orig

I said above we would cache the amount of text in each subtree. Here I just put a * for the number of bytes because this node contains the whole document, and we don't know how long that is. You could use any large-enough number, say 0x4000000000000000L.

After inserting "ab" at position 2:

    2: orig, 2 bytes
*: insert "ab", delete nothing
    *: orig, all the rest

After inserting "cde" at position 1:

        1: orig, 1 byte
    5: insert "cde", delete nothing
        1: orig, 1 byte
*: insert "ab", delete nothing
    *: orig, all the rest

The next step is to delete a character at position 4. Pause here to see how we find position 4 in the tree.

Start at the root. Look at the first child node: that subtree contains 5 characters. So position 4 must be in there. Move to that node. Look at its first child node. This time it contains only 1 character. Not in there. This edit contains 3 characters, so it's not in here either; it's immediately after. Move to the second child node. (This algorithm is about 12 lines of code.)

After deleting 1 character at position 4, you get this...

    4: orig, 1 byte
        3: insert "cde", delete nothing
*: insert "ab", delete nothing
    *: orig, all the rest

...and then, noticing two adjacent insert nodes, you merge them. (Note that given two adjacent nodes, one is always somewhere above the other in the tree hierarchy. Merge the data into that higher node; then delete the lower one and update the cached subtree sizes in between.)

    1: orig, 1 byte
*: insert "cdeab", delete nothing
    *: orig, all the rest
Jason Orendorff
Could you please give an example, it sounds good, but I don't get it...
Harmen
OK, updated with a worked example. I put in as much detail as I could. In short you want to find existing code for a balanced tree, add a subtreeSize field to each node, and wrap it in a class that keeps the subtreeSize fields correct and merges adjacent insert nodes. It's nontrivial.
Jason Orendorff
A: 

If n is the number of actions, why this problem has a complexity of O(n*n)? Shouldn't it be O(n)?

I think the performance depends most on the representation of the text and how fast changes could be made on this text representation. If the representation for example is an character array, then there have to be a lot of memory movements. If the representation is a linked list, then character insertions and deletions can be fast accomplished.

Christian Ammer
I'm afraid you don't get the problem... It's about merging actions, not executing them.
Harmen
Sorry, i should read the question more carefully.
Christian Ammer
A: 

Let's assume for simplicity that only letters a-z appear in your texts.

Initialize a list A with values a[i] = i for i = 1 to N (you will figure out yourself how big N should be).

Perform (simulate) all your operations on A. After this analyze A to find required operations:

Fist find required delete operations by finding missing numbers in A (they will form groups of consecutive values, one group stands for one delete operation).

After this you can find required insert operations by finding sequences of consecutive letters (one sequence is one insert operation).

In your example:

init A:

1 2 3 4 5 6 7 8 9 10

Step 1 (+:2:ab):

1 a b 2 3 4 5 6 7 8 9 10

Step2 (+:1:cde):

c d e 1 a b 2 3 4 5 6 7 8 9 10

Step3 (-:4:1):

c d e a b 2 3 4 5 6 7 8 9 10

Now we search for missing numbers to find deletes. In our example only one number (namely number 1) is missing, so only 1 delete is required, so we have one delete operation: -:1:1 (In general there may be more numbers missing, every sequence of missing numbers is one delete operation. For example if 1, 2, 3, 5, 6, 10 are all missing numbers, then there are 3 delete operations: -:1:3, -:2:2, -:5:1. Remember that after every delete operation all indexes are decreased, you have to store total sum of former delete operations to calculate the index of current delete operation.)

Now we search for character sequences to find insert operations. In our example there is only one sequence: cdeab at index 1, so we have one insert operation: +:1:cdeab

Hope this is clear enough.

witzar
Another very different approach, thank you! But it's not fast enough when actions are executed on very different positions, e.g. `+ 100000 a` and `- 2 1` —this can be done in O(1), while your solution needs 100000 characters to loop through three times...
Harmen
what is complexity of your solution? can you give more complicated example of how it works? maybe: with more than one resulting insertion and deletion...
WildWezyr
@Harmen I only stated that to solve the problem it is best to perform operations and then read the result. I named data structure a List just for convinience of the argument, but I never defined it's implementation (LinkedList? TreeList? Custom implementation that pretends a[i]=i for uninitialized indexes?). But you didn't state how big is original text, how many input operation there is and what should be optimized, so I'm excused ;)
witzar
A: 

I try another answer to your first question (how to reduce the number of actions): An algorithmic approach could try to sort the actions. I think, that after sorting:

  1. The chance that neighbouring actions can be joined (in the manner Svante and peterchen showed), will rise.
  2. This may lead to the minimum number of actions that have to be performed?

In the following "position-number" stands for the text insertion or deletion position.

Assuming it is possible to swap two neighboring actions (by adjusting position-numbers and text/length property of this two actions), we can bring the action-list to any order we like. I suggest to bring the deletion actions to the front of the action list with ascending position-numbers. After the deletion actions the addition-actions are sorted with ascending position-numbers.

The following examples should demonstrate, why i think it is possible to swap any neighboring actions.

Swaping following actions:

  1. + 2 aaa -> taaaext
  2. - 3 1   -> taaext

yields to one action:

  1. + 2 aa  -> taaext

Swaping following actions:

  1. + 3 aaa -> teaaaxt
  2. + 1 bb  -> bbteaaaxt

yields to:

  1. + 1 bb  -> bbtext
  2. + 5 aaa -> bbteaaaxt

Swaping following actions:

  1. + 1 bb  -> bbtext
  2. - 2 2   -> bext

yields to:

  1. - 1 1   -> ext
  2. + 1 b   -> bext

As the first example shows, in some cases a swap causes the removal of a deletion. This is a benefiting side effect. This is also the matter why i suggest to move all deletions to the front.

I hope that i didn't forget something and considered all circumstances.

Christian Ammer