views:

244

answers:

6

Hello.

I have the following problem.

I have a set of elements that I can sort by a certain algorithm A . The sorting is good, but very expensive.

There is also an algorithm B that can approximate the result of A. It is much faster, but the ordering will not be exactly the same.

Taking the output of A as a 'golden standard' I need to get a meaningful estimate of the error resulting of the use of B on the same data.

Could anyone please suggest any resource I could look at to solve my problem? Thanks in advance!

EDIT :

As requested : adding an example to illustrate the case : if the data are the first 10 letters of the alphabet,

A outputs : a,b,c,d,e,f,g,h,i,j

B outputs : a,b,d,c,e,g,h,f,j,i

What are the possible measures of the resulting error, that would allow me to tune the internal parameters of algorithm B to get result closer to the output of A?

+1  A: 

you could try something involving hamming distance

yx
I don't think Hamming distance is a good solution for this problem. It offers an element by element comparison but the distance between two elements does not say anything about sorting quality.
Ronald Wildenberg
you are right, I did not say only using hamming distance, but just something that involves it. If he wants to do a more expensive estimate, he should use distance calculations.
yx
+3  A: 

Are you looking for finding some algorithm that calculates the difference based on array sorted with A and array sorted with B as inputs? Or are you looking for a generic method of determining on average how off an array would be when sorted with B?

If the first, then I suggest something as simple as the distance each item is from where it should be (an average would do better than a sum to remove length of array as an issue)

If the second, then I think I'd need to see more about these algorithms.

Ed Marty
This isn't good enough, as if for example the the list is z, a, b, c, d… the whole list is shifted by 1.
Georg
+2  A: 

Calculating RMS Error may be one of the many possible methods. Here is small python code.

def calc_error(out_A,out_B):
        # in    <= input
        # out_A <= output of algorithm A
        # out_B <= output of algorithm B

        rms_error = 0

        for i in range(len(out_A)):
            # Take square of differences and add
            rms_error +=  (out_A[i]-out_B[i])**2 

        return rms_error**0.5   # Take square root

>>> calc_error([1,2,3,4,5,6],[1,2,3,4,5,6])
0.0
>>> calc_error([1,2,3,4,5,6],[1,2,4,3,5,6]) # 4,3 swapped
1.414
>>> calc_error([1,2,3,4,5,6],[1,2,4,6,3,5]) # 3,4,5,6 randomized
2.44

NOTE: Taking square root is not necessary but taking squares is as just differences may sum to zero. I think that calc_error function gives approximate number of wrongly placed pairs but I dont have any programming tools handy so :(.

Take a look at this question.

TheMachineCharmer
I was thinking about RMSE too. But the original question says "sorting is expensive", so I have to assume that the error metric must be calculated without ever having a canonical sorting to compare against. And without the canonical order, you can't compute RMSE.
benjismith
No, the OP has access to the gold standard for training purposes. He wants an error function so he can optimize his approximate sorter before turning it loose.
John Fouhy
+4  A: 

I would determine the largest correctly ordered sub set.

                               +-------------> I
                               |   +--------->
                               |   |
A -> B -> D ----->  E  -> G -> H --|--> J
     |             ^ |             |    ^
     |             | |             |    |
     +------> C ---+ +-----------> F ---+

In your example 7 out of 10 so the algorithm scores 0.7. The other sets have the length 6. Correct ordering scores 1.0, reverse ordering 1/n.

I assume that this is related to the number of inversions. x + y indicates x <= y (correct order) and x - y indicates x > y (wrong order).

A + B + D - C + E + G + H - F + J - I

We obtain almost the same result - 6 of 9 are correct scorring 0.667. Again correct ordering scores 1.0 and reverse ordering 0.0 and this might be much easier to calculate.

Daniel Brückner
+2  A: 
bubaker
Yes This looks very much like the thing I was looking for, Thanks!
+2  A: 

It's tough to give a good generic answer, because the right solution for you will depend on your application.

One of my favorite options is just the number of in-order element pairs, divided by the total number of pairs. This is a nice, simple, easy-to-compute metric that just tells you how many mistakes there are. But it doesn't make any attempt to quantify the magnitude of those mistakes.

double sortQuality = 1;
if (array.length > 1) {
   int inOrderPairCount = 0;
   for (int i = 1; i < array.length; i++) {
      if (array[i] >= array[i - 1]) ++inOrderPairCount;
   }
   sortQuality = (double) inOrderPairCount / (array.length - 1);
}
benjismith