views:

603

answers:

6

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.

I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).

For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.

+5  A: 

You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.

Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.

Pick an arbitrary element of A, X.

For each element of the array, compare it with X.

Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.

Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).

Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.

Igor Krivokon
So in a lovely, equal problem spread world (equal hits to -2..+2 buckets) this would be a n log^4 n solution to sorting rather than a n log^2 n solution
Tom Leys
@Tom, thats the same complexity, the log base is just like a constant multiplier.
wowest
Also, you mean log_4 n (log to base 4), not log^4 n (which means log-n to the fourth power).
ShreevatsaR
+1  A: 

It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.

Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.

This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.

the particular application I have in mind is where humans are actually (subjectively) providing the pair-wise comparison
James Tauber
An interesting application. So in theory you are trying to reduce the number of comparisons to the minimum possible.
Tom Leys
Tom, yes, reduce the number of comparisons at the expense of being only a near sort
James Tauber
+1  A: 

I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.

Consider this possibility (all strings use the 27 characters a-z and _):

            11111111112
   12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___

Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.

One approach is to scale the difference value for each identical character position and use the first different character to set the last position.

Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:

digit = ceiling(9 * abs(diff) / 27)

since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.

And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.

Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.

Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.

paxdiablo
+1  A: 

Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.

Then you just do a regular sort based on the scores.

Another alternative would be a single or double elimination tournament structure.

Tom Leys
I've considered doing a near-sorting first as a way of seeding a tournament structure
James Tauber
A: 

You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.

Here is a example of what I mean in Perl. It compares two array references by the first element, then by the second element.

use strict;
use warnings;
use 5.010;

my @array = (
  [a => 2],
  [b => 1],
  [a => 1],
  [c => 0]
);

say "$_->[0] => $_->[1]" for sort {
  ($a->[0] cmp $b->[0]) * 2 +
  ($a->[1] <=> $b->[1]);
} @array;
a => 1
a => 2
b => 1
c => 0

You could extend this to any number of comparisons very easily.

Brad Gilbert
A: 

Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.

Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}

Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}

But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.

For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.

The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

dwc
the task is a human subjectively expressing their preference for items in a collection in a pairwise fashion in order to be able to near-sort that collection by the person's preference
James Tauber