ansaurus

Question

Algorithm for rating the monotonicity of an array (i.e. judging the "sortedness" of an array)

Answer 1

+4 A:

This seems like a good candidate for ~~Levenshtein~~ Damerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.

Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.

ap = a.sort
sum = 0
a.each_index{|i| j = ap.index(a[i])-i 
  sum += (j*j)
}
dist = sum/(a.size*a.size)

AShelly 2010-01-20 19:14:57

But that isn't what levenshtein distance is. levenshtein distance is the edit distance, the minimum number of edit operations (insert,delete, and substitute) to get from one sequence to the other.

nlucaroni 2010-01-20 19:40:40

The general approach is interesting, one could try to find out how many "swap 2 intervals from the sequence" operations are needed to sort the array. But I suspect, in practice that is very hard to calculate.

Doc Brown 2010-01-20 19:49:20

@Doc, again, swap distance isn't levenshtein distance.

nlucaroni 2010-01-20 22:37:51

Forgot that strict Levenshtein doesn't include transpositions - D-L does.

AShelly 2010-01-21 00:47:52

@nlucaroni: yes, I know that, but you misunderstood my comment - I meant it would be a possible approach to measure the sortedness by the number of swaps needed, just like 'Levenshtein' measures the number of inserts, deletes, substitutes. If this is also good approach? Honestly, I don't know.

Doc Brown 2010-01-21 06:36:40

@AShelly, yes that'll do. But my point was directed more toward the other events that this algorithm handles (insertions, deletions, and substitutions). I feel (and this is obviously subjective toward my current understanding of the issue), that those events do not make sense in this context, they aren't the events that turn one unsorted permutation into another.

nlucaroni 2010-01-21 17:11:18

Answer 2

+2 A:

Something like these? http://en.wikipedia.org/wiki/Rank_correlation

KennyTM 2010-01-20 19:16:40

Answer 3

+3 A:

I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.

Jeffrey L Whitledge 2010-01-20 19:27:51

Answer 4

+1 A:

I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].

It really all depends on what the number means and if this distance function makes sense in your context though.

nlucaroni 2010-01-20 19:38:59

By treating this as Pancake problem, if the array is sorted descending, there is just one 'flip' operation needed to sort it, so it will be seen as 'almost sorted'. I suspect that is not what the OP wants.

Doc Brown 2010-01-20 19:53:23

It is almost sorted. Besides, he only said monotonicity. Descending or Ascending, still, it shows an essence of sorted. I would say that 7654321 is more sorted then 4237516. It solves his 'clumping' issue.

nlucaroni 2010-01-20 21:25:45

Answer 5

A:

It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.

Mark Bessey 2010-01-20 20:00:40

That will most likely give *very* different results according to the algorithm used.

Doc Brown 2010-01-20 20:54:50

That's true, of course - though any reasonably-smart sort algorithm like mergesort or quicksort will have generally-decreasing time for "more sorted" input.

Mark Bessey 2010-01-20 22:24:41

A naïve version of quicksort, in which the first element of each subrange is taken to be the partitioning element, will famously be O(n^2) for an already-sorted list, so you do have to be careful about this! According to Sedgewick, insertion sort is your best bet for a mostly-sorted list.

Jeffrey L Whitledge 2010-01-21 00:32:38

Answer 6

+1 A:

Here's one I just made up.

For each pair of adjacent values, calculate the numeric difference between them. If the second is greater than or equal to the first, add that to the sorted total, otherwise add to the unsorted total. When done, take the ratio of the two.

Mark Ransom 2010-01-20 20:08:22

Answer 7

+1 A:

Compute the lenghts of all sorted sub-sequences, then square them and add them. If you want to calibrate how much enphasis you put on largest, use a power different than 2.

I'm not sure what's the best way to normalize this by length, maybe divide it per length squared?

Emilio M Bumachar 2010-01-20 20:10:04

Answer 8

A:

Some experiments with a modifier Ratcliff & Obershelp

>>> from difflib import SequenceMatcher as sm
>>> a = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> c = [ 0, 1, 9, 2, 8, 3, 6, 4, 7, 5 ]
>>> b = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> b.sort()
>>> s = sm(None, a, b)
>>> s.ratio()
0.69999999999999996
>>> s2 = sm(None, c, b)
>>> s2.ratio()
0.29999999999999999

So kind of does what it needs to. Not too sure how to prove it though.

pulegium 2010-01-20 20:12:27

Answer 9

+2 A:

What you're probably looking for is Kendall Tau. It's a one-to-one function of the bubble sort distance between two arrays. To test whether an array is "almost sorted", compute its Kendall Tau against a sorted array.

dsimcha 2010-01-20 20:16:04

ansaurus

tags:

views:

answers:

Algorithm for rating the monotonicity of an array (i.e. judging the "sortedness" of an array)

related questions