views:

209

answers:

9

EDIT: Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n).)


As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of the array:

public double monotonicity(int[] array) {
    if (array.length == 0) return 1d;

    int longestRun = longestSortedRun(array);
    return (double) longestRun / (double) array.length;
}

public int longestSortedRun(int[] array) {

    if (array.length == 0) return 0;

    int longestRun = 1;
    int currentRun = 1;

    for (int i = 1; i < array.length; i++) {
        if (array[i] >= array[i - 1]) {
            currentRun++;
        } else {
            currentRun = 1;
        }

        if (currentRun > longestRun) longestRun = currentRun;
    }

    return longestRun;
}

This is a good start, but it fails to take into account the possibility that there may be "clumps" of sorted sub-sequences. E.g.:

{ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9}

This array is partitioned into three sorted sub-sequences. My algorithm will rate it as only 40% sorted, but intuitively, it should get a higher score than that. Is there a standard algorithm for this sort of thing?

+4  A: 

This seems like a good candidate for Levenshtein Damerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.

Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.

ap = a.sort
sum = 0
a.each_index{|i| j = ap.index(a[i])-i 
  sum += (j*j)
}
dist = sum/(a.size*a.size)
AShelly
But that isn't what levenshtein distance is. levenshtein distance is the edit distance, the minimum number of edit operations (insert,delete, and substitute) to get from one sequence to the other.
nlucaroni
The general approach is interesting, one could try to find out how many "swap 2 intervals from the sequence" operations are needed to sort the array. But I suspect, in practice that is very hard to calculate.
Doc Brown
@Doc, again, swap distance isn't levenshtein distance.
nlucaroni
Forgot that strict Levenshtein doesn't include transpositions - D-L does.
AShelly
@nlucaroni: yes, I know that, but you misunderstood my comment - I meant it would be a possible approach to measure the sortedness by the number of swaps needed, just like 'Levenshtein' measures the number of inserts, deletes, substitutes. If this is also good approach? Honestly, I don't know.
Doc Brown
@AShelly, yes that'll do. But my point was directed more toward the other events that this algorithm handles (insertions, deletions, and substitutions). I feel (and this is obviously subjective toward my current understanding of the issue), that those events do not make sense in this context, they aren't the events that turn one unsorted permutation into another.
nlucaroni
+2  A: 

Something like these? http://en.wikipedia.org/wiki/Rank_correlation

KennyTM
+3  A: 

I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.

Jeffrey L Whitledge
+1  A: 

I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].

It really all depends on what the number means and if this distance function makes sense in your context though.

nlucaroni
By treating this as Pancake problem, if the array is sorted descending, there is just one 'flip' operation needed to sort it, so it will be seen as 'almost sorted'. I suspect that is not what the OP wants.
Doc Brown
It is almost sorted. Besides, he only said monotonicity. Descending or Ascending, still, it shows an essence of sorted. I would say that 7654321 is more sorted then 4237516. It solves his 'clumping' issue.
nlucaroni
A: 

It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.

Mark Bessey
That will most likely give *very* different results according to the algorithm used.
Doc Brown
That's true, of course - though any reasonably-smart sort algorithm like mergesort or quicksort will have generally-decreasing time for "more sorted" input.
Mark Bessey
A naïve version of quicksort, in which the first element of each subrange is taken to be the partitioning element, will famously be O(n^2) for an already-sorted list, so you do have to be careful about this! According to Sedgewick, insertion sort is your best bet for a mostly-sorted list.
Jeffrey L Whitledge
+1  A: 

Here's one I just made up.

For each pair of adjacent values, calculate the numeric difference between them. If the second is greater than or equal to the first, add that to the sorted total, otherwise add to the unsorted total. When done, take the ratio of the two.

Mark Ransom
+1  A: 

Compute the lenghts of all sorted sub-sequences, then square them and add them. If you want to calibrate how much enphasis you put on largest, use a power different than 2.

I'm not sure what's the best way to normalize this by length, maybe divide it per length squared?

Emilio M Bumachar
A: 

Some experiments with a modifier Ratcliff & Obershelp

>>> from difflib import SequenceMatcher as sm
>>> a = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> c = [ 0, 1, 9, 2, 8, 3, 6, 4, 7, 5 ]
>>> b = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> b.sort()
>>> s = sm(None, a, b)
>>> s.ratio()
0.69999999999999996
>>> s2 = sm(None, c, b)
>>> s2.ratio()
0.29999999999999999

So kind of does what it needs to. Not too sure how to prove it though.

pulegium
+2  A: 

What you're probably looking for is Kendall Tau. It's a one-to-one function of the bubble sort distance between two arrays. To test whether an array is "almost sorted", compute its Kendall Tau against a sorted array.

dsimcha