tags:

views:

149

answers:

5

I need an algorithm which calculates something similar to the Levenshtein distance. It should calculate the minimal possible number of transformations to get from one sequence to another. The allowed transformations are deletion, insertion and move:

Deletion of 3: before {1, 2, 3, 4}, after {1, 2, 4}

Insertion of 5: before {1, 2, 3, 4}, after {1, 2, 5, 3, 4}

Move of 4: before {1, 2, 3, 4}, after {4, 1, 2, 3}

So the algorithm should calculate the minimal number of such transformations given the starting and ending sequence of numbers and, ideally, give the list of the transformations needed. One important trait of the sequences is that number are never duplicated.

I have an idea that I can modify the Levenshtein algorithm so it only counts the number of deletions and insertions and ignore substitutions. And the number of moves would be number of deletions plus number of insertions divided by 2. But I'm not sure if it's correct.

Does anybody know such an algorithm?

EDIT:

I probably should have said, that this algorithm will work on a sequence of sequences. For example:

{ {1, 2, 3}, {4}, {5, 6, 7} }

where the numbers are not duplicated. The total number of the inner elements in the sequence is not changed. Elements can migrate from one inner sequence to another inner sequence. The number of inner sequences is also constant. So it could be

{ {1, 2, 3, 4, 5, 6, 7}, {}, {} }
{ {}, {1, 2, 3, 4, 5, 6, 7}, {} }
{ {}, {}, {1, 2, 3, 4, 5, 6, 7} }

The idea is to calculate the distance for each corresponding inner sequence and then sum them to get a distance for the outer sequence.
So the elements can be deleted or inserted to the inner sequence, but they never disappear from the outer sequence.

The most important aspect of this algorithm is that it should find MINIMAL number of transformations. Not just some number of transofrmations.

A: 

Consider:

abc

which is transformed to:

abcdef

This is 3 insertions, 0 deletions, and 0 moves. But, your algorithm would give 3 insertions, 0 deletions, and 1.5 moves.

carl
Actually it would give 3 as an answer. Which can be interpreted as 3 insertions, 3 deletions or 1.5 moves. The number that returns the algorithm is a distance between sequences, so it's a measure of how far they are from each other. Keeping that in mind what I did is that I assigned different costs for different operations. 1 for deletion and insertion and 2 for move.
Max
+1  A: 

You could keep track of the deleted and inserted numbers and in the end calculate the number of moves by inspecting these two sets to find numbers that were deleted and inserted again:

moved = intersection(deleted, inserted)
moves = sizeof(moved)
deletions = sizeof(deleted) - sizeof(moved)
insertions = sizeof(inserted) - sizeof(moved)
sth
A: 

Let's see if I understood your problem correctly: you have given a sequence X of (possibly empty) sequences where each number from 1 to n is contained in exactly one (inner) sequence. Using just two types of operations you have to transform X to another given sequence Y of the same kind where you want to minimize the number of operations needed in the transformation.

The first operation is to move a number from any position to any other position within the same inner sequence. The second operation is to move a number from any position in one inner sequence to any position in another inner sequence.

Is this your problem? If yes, read on.

A generic approach to solve this problem is to consider the unordered graph given by taking all sequences of the same kind as X and Y as set of vertices. Two vertices are connected by an edge in the graph if and only if they can be transformed into each other using exactly one of the two allowed operations (this relation is symmetric since you can undo an allowed operation by another allowed operation). Now use a shortest-path-finding algorithm like breadth first search where you dynamically generate the neighbors.

As the number of reachable vertices grows roughly exponential in the number of steps taken, in practice you should run the algorithm alternately from starting point X and from starting point Y. You switch the starting point whenever all vertices reachable in 1, 2, 3, ... steps is determined and you stop when the sets of reachable points from X rsp. Y intersect.

The running time and memory usage for this generic algorithm will be too high for even moderate n, but it can be optimized tremendously using two observations:

  • The elements that are contained already in the correct sequence from the beginning can be resorted using the first operation independently from all other elements (i.e., those contained in other inner sequences or those that have to be moved to another inner sequence).

  • Elements moved from one inner sequence to another by the second operation can be put directly into a correct position in the correct inner sequence so that they don't have to be moved again by another operation.

As the sorting within the inner sequences and moving of elements between the inner sequences are all independent, the total number of operations is the sum of the minimal numbers to sort the inner sequences (only the elements that are in the same inner sequence for X and Y have to be considered) plus the number of elements that have to be moved from one inner sequence to another.

To determine the number of (first) operations to sort the elements already contained correctly in an inner sequence just ignore (cancel) the elements that have to be moved to another inner sequence. Then use the generic approach described above for the graph with vertices the set of sequences consisting of the elements not to be moved. Two vertices exists in the graph are connected if moving an element (i.e., applying the first operation) transforms one vertex (= sequence) into the other.

Whoever
A: 
Beta
+1  A: 

Since your lists contain only unique elements it is clear which ones you have to remove and which ones you have to insert. What remains is the problem to find the smallest number of moves that is necessary starting from one list to get another list that contains the same elements. It is always possible to relabel the elements in the two lists such that the elements in the staring list are increasing.

E.g. we might get the problem to find the smallest number of moves starting from the list

[1,2,3,4,5,6,7]

that gives the list

[3,5,1,2,4,7,6].

To solve this problem one can use a little trick. Rather than trying to find the smallest number of elements to move, it is easier to find the largest number of elements that need not be moved. These must be elements in the second list that are in increasing order. This is the longest increasing subsequence problem. In the example above 1,2,4,7 would for example be a maximal subset. Hence a minimal set of elements to move would be {3,5,6}.

Accipitridae
This seems to be correct. Great trick! Thanks a lot!!!
Max