views:

314

answers:

7

I have two arrays containing the same elements, but in different orders, and I want to know the extent to which their orders differ.

The method I tried, didn't work. it was as follows:

For each list I built a matrix which recorded for each pair of elements whether they were above or below each other in the list. I then calculated a pearson correlation coefficient of these two matrices. This worked extremely badly. Here's a trivial example:

list 1: 1 2 3 4

list 2: 1 3 2 4

The method I described above produced matrices like this (where 1 means the row number is higher than the column, and 0 vice-versa):

list 1:

  1 2 3 4
1   1 1 1
2     1 1
3       1
4

list 2:

  1 2 3 4 
1   1 1 1
2     0 1 
3       1
4

Since the only difference is the order of elements 2 and 3, these should be deemed to be very similar. The Pearson Correlation Coefficient for those two matrices is 0, suggesting they are not correlated at all. I guess the problem is that what I'm looking for is not really a correlation coefficient, but some other kind of similarity measure. Edit distance, perhaps?

Can anyone suggest anything better?

Thanks,

Ben

+7  A: 

Mean square of differences of indices of each element.

List 1: A B C D E
List 2: A D C B E

Indices of each element of List 1 in List 2 (zero based)

A B C D E
0 3 2 1 4

Indices of each element of List 1 in List 1 (zero based)

A B C D E
0 1 2 3 4

Differences:

A  B C D E
0 -2 0 2 0

Square of differences:

A B C D E
  4   4

Average differentness = 8 / 5.

jamesh
How does this work when the arrays are (a,b,c,d,e) vs. (e,a,b,c,d)? The arrays differ only by a single rotation. But the above algorithm will give a difference of 4 ((16+1+1+1+1)/5). Is this what the OP intends?
jmucchiello
That single rotation moved e to the opposite end of the list. Seems like a reasonable measure to me.
recursive
+1  A: 

You might consider how many changes it takes to transform one string into another (which I guess it was you were getting at when you mentioned edit distance).

See: http://en.wikipedia.org/wiki/Levenshtein_distance

Although I don't think l-distance takes into account rotation. If you allow rotation as an operation then:

1, 2, 3, 4

and

2, 3, 4, 1

Are pretty similar.

Dana
+1  A: 

Just an idea, but is there any mileage in adapting a standard sort algorithm to count the number of swap operations needed to transform list1 into list2?

I think that defining the compare function may be difficult though (perhaps even just as difficult as the original problem!), and this may be inefficient.

edit: thinking about this a bit more, the compare function would essentially be defined by the target list itself. So for example if list 2 is:

1 4 6 5 3

...then the compare function should result in 1 < 4 < 6 < 5 < 3 (and return equality where entries are equal).

Then the swap function just needs to be extended to count the swap operations.

frankodwyer
A: 

There is a branch-and-bound algorithm that should work for any set of operators you like. It may not be real fast. The pseudocode goes something like this:

bool bounded_recursive_compare_routine(int* a, int* b, int level, int bound){
    if (level > bound) return false;
    // if at end of a and b, return true
    // apply rule 0, like no-change
    if (*a == *b){
        bounded_recursive_compare_routine(a+1, b+1, level+0, bound);
        // if it returns true, return true;
    }
    // if can apply rule 1, like rotation, to b, try that and recur
    bounded_recursive_compare_routine(a+1, b+1, level+cost_of_rotation, bound);
    // if it returns true, return true;
    ...
    return false;
}

int get_minimum_cost(int* a, int* b){
    int bound;
    for (bound=0; ; bound++){
        if (bounded_recursive_compare_routine(a, b, 0, bound)) break;
    }
    return bound;
}

The time it takes is roughly exponential in the answer, because it is dominated by the last bound that works.

Added: This can be extended to find the nearest-matching string stored in a trie. I did that years ago in a spelling-correction algorithm.

Mike Dunlavey
A: 

I'm not sure exactly what formula it uses under the hood, but difflib.SequenceMatcher.ratio() does exactly this:

ratio(self) method of difflib.SequenceMatcher instance:
    Return a measure of the sequences' similarity (float in [0,1]).

Code example:

from difflib import SequenceMatcher
sm = SequenceMatcher(None, '1234', '1324')
print sm.ratio()

>>> 0.75
Deestan
A: 

Another approach that is based on a little bit of mathematics is to count the number of inversions to convert one of the arrays into the other one. An inversion is the exchange of two neighboring array elements. In ruby it is done like this:

# extend class array by new method
class Array
  def dist(other)
    raise 'can calculate distance only to array with same length' if length != other.length
    # initialize count of inversions to 0
    count = 0
    # loop over all pairs of indices i, j with i<j
    length.times do |i|
      (i+1).upto(length) do |j|
        # increase count if i-th and j-th element have different order
        count += 1 if (self[i] <=> self[j]) != (other[i] <=> other[j])
      end
    end
    return count
  end
end
l1 = [1, 2, 3, 4]
l2 = [1, 3, 2, 4]
# try an example (prints 1)
puts l1.dist(l2)

The distance between two arrays of length n can be between 0 (they are the same) and n*(n+1)/2 (reversing the first array one gets the second). If you prefer to have distances always between 0 and 1 to be able to compare distances of pairs of arrays of different length, just divide by n*(n+1)/2.

A disadvantage of this algorithms is it running time of n^2. It also assumes that the arrays don't have double entries, but it could be adapted.

A remark about the code line "count += 1 if ...": the count is increased only if either the i-th element of the first list is smaller than its j-th element and the i-th element of the second list is bigger than its j-th element or vice versa (meaning that the i-th element of the first list is bigger than its j-th element and the i-th element of the second list is smaller than its j-th element). In short: (l1[i] < l1[j] and l2[i] > l2[j]) or (l1[i] > l1[j] and l2[i] < l2[j])

+1  A: 

A bit late for the party here, but just for the record, I think Ben almost had it... if you'd looked further into correlation coefficients, I think you'd have found that Spearman's rank correlation coefficient might have been the way to go.

Interestingly, jamesh seems to have derived a similar measure, but not normalized.

See this recent SO answer.

bubaker