views:

466

answers:

10

I want to sort items where the comparison is performed by humans:

  • Pictures
  • Priority of work items
  • ...

For these tasks the number of comparisons is the limiting factor for performance.

  • What is the minimum number of comparisons needed (I assume > N for N items)?
  • Which algorithm guarantees this minimum number?
+1  A: 

I don't think you're likely to get a better answer than the Wikipedia page on sorting.

Summary:

  • For arbitrary comparisons (where you can't use something like radix sorting) the best you can achieve is O(n log n)
  • Various algorithms achieve this - see the "comparison of algorithms" section.
  • The commonly used QuickSort is O(n log n) in a typical case, but O(n^2) in the worst case; there are often ways to avoid this, but if you're really worried about the cost of comparisons, I'd go with something like MergeSort or a HeapSort. It partly depends on your existing data structures.

If humans are doing the comparisons, are they also doing the sorting? Do you have a fixed data structure you need to use, or could you effectively create a copy using a balanced binary tree insertion sort? What are the storage requirements?

Jon Skeet
O(n log n) is only the best general sort. There are a few sorts, such as pigeon holing, that are o(n) albeit limited to specific types of data.
Shane MacLaughlin
Hence the "For arbitrary comparisons" part of my first point.
Jon Skeet
Fair enough, but if you have human interaction at every comparison based on recognising images, I would doubt the applicability of many arbitrary methods. Many manual sorts, e.g. filing, aim for o(n) even if they fail to achieve it. As you have asked, we need to know more about the specifics of the problem to give a good answer.
Shane MacLaughlin
Yup - it's definitely a case where the details could make a huge difference.
Jon Skeet
+4  A: 

Pigeon hole sorting is order N and works well with humans if the data can be pigeon holed. A good example would be counting votes in an election.

Shane MacLaughlin
A: 

The best one would be the merge sort

The minimum run time is n*log(n) [Base 2] The way it is implemented is

If the list is of length 0 or 1, then it is already sorted.

Otherwise:

Divide the unsorted list into two sublists of about half the size.

Sort each sublist recursively by re-applying merge sort.

Merge the two sublists back into one sorted list.

Jaelebi
+1  A: 

Here is a comparison of algorithms. The two better candidates are Quick Sort and Merge Sort. Quick Sort is in general better, but has a worse worst case performance.

kgiannakakis
+1 agreed... I usually use a combination of quicksort (for large sets) and mergesort (for small sets) myself, though I never tried to figure out if it was the optimal way to go.
David Zaslavsky
+3  A: 

You should consider that humans might make non-transitive comparisons, e.g. they favor A over B, B over C but also C over A. So when choosing your sort algorithm, make sure it doesn't completely break when that happens.

ammoQ
+3  A: 

People are really good at ordering 5-10 things from best to worst and come up with more consistent results when doing so. I think trying to apply a classical sorting algo might not work here because of the typically human multi-compare approach.

I'd argue that you should have a round robin type approach and try to bucket things into their most consistent groups each time. Each iteration would only make the result more certain.

It'd be interesting to write too :)

Alex Gartrell
It's an interesting point. Most sorting algorithms only compare two things at a time, whereas people seem to be able rank a small number of items quite quickly, relatively speaking. Maybe we're a little bit parallel ;) Incidentally, bucket sort and pigeon sort are pretty much the same thing.
Shane MacLaughlin
+1  A: 

Merge sort is definately the way to go here as you can use a Map/Reduce type algorithm to have several humans doing the comparisons in parallel.

Quicksort is essentially a single threaded sort algorithm.

You could also tweak the merge sort algorithm so that instead of comparing two objects you present your human with a list of say five items and ask him or her to rank them.

Another possibility would be to use a ranking system as used by the famous "Hot or Not" web site. This requires many many more comparisons, but, the comparisons can happen in any sequence and in parallel, this would work faster than a classic sort provided you have enough huminoids at your disposal.

James Anderson
Sure, m humans can start mergesorting n/m items each "right away", while for quicksort there is a "ramping up" period at the start -- you need log(m) partitioning steps before you have enough tasks for m people. But doesn't mergesort have the same problem at the *end* of the algorithm? The final merge step must be performed by a single person, right? Quicksort OTOH hand keeps everyone busy till the end.
j_random_hacker
+1  A: 

From an assignment I once did on this very subject ...

The comparison counts are for various sorting algorithms operating on data in a random order

Size      QkSort    HpSort   MrgSort     ModQk   InsrtSort
  2500     31388     48792     25105     27646     1554230
  5000     67818    107632     55216     65706     6082243
 10000    153838    235641    120394    141623    25430257
 20000    320535    510824    260995    300319   100361684
 40000    759202   1101835    561676    685937
 80000   1561245   2363171   1203335   1438017
160000   3295500   5045861   2567554   3047186

These comparison counts are for various sorting algorithms operating on data that is started 'nearly sorted'. Amongst other things it shows a the pathological case of quicksort.

Size      QkSort    HpSort   MrgSort     ModQk   InsrtSort
  2500     72029     46428     16001     70618      76050
  5000    181370    102934     34503    190391    3016042
 10000    383228    226223     74006    303128   12793735
 20000    940771    491648    158015    744557   50456526
 40000   2208720   1065689    336031   1634659  
 80000   4669465   2289350    712062   3820384
160000  11748287   4878598   1504127  10173850

From this we can see that merge sort is the best by number of comparisons.

I can't remember what the modifications to the quick sort algorithm were, but I believe it was something that used insertion sorts once the individual chunks got down to a certain size. This sort of thing is commonly done to optimise quicksort.

You might also want to look up Tadao Takaoka's 'Minimal Merge Sort', which is a more efficient version of the merge sort.

ConcernedOfTunbridgeWells
+1  A: 

The questions raises more questions really.

Are we talking a single human performing the comparisons? It's a very different challenge if you are talking a group of humans trying to arrange objects in order.

What about the questions of trust and error? Not everyone can be trusted or to get everything right - certain sorts would go catastrophically wrong if at any given point you provided the wrong answer to a single comparison.

What about subjectivity? "Rank these pictures in order of cuteness". Once you get to this point, it could get really complex. As someone else mentions, something like "hot or not" is the simplest conceptually, but isn't very efficient. At it's most complex, I'd say that google is a way of sorting objects into an order, where the search engine is inferring the comparisons made by humans.

Cybergibbons
I assumed that a single human makes the comparisons. So I expect them to be consistent (as far as a human can be...). Of course they are subjective and maybe wrong sometimes.If many persons do the (subjective) comparison, I would use something like the chess ELO numbering, as mentioned in http://stackoverflow.com/questions/164831/how-to-rank-a-million-images-with-a-crowdsourced-sort
gyrolf
+3  A: 

To answer this, we need to make a lot of assumptions.

Let's assume we are sorting pictures by cuteness. The goal is to get the maximum usable information from the human in the least amount of time. This interaction will dominate all other computation, so it's the only one that counts.

As someone else mentioned, humans can deal well with ordering several items in one interaction. Let's say we can get eight items in relative order per round.

Each round introduces seven edges into a directed graph where the nodes are the pictures. If node A is reachable from node B, then node A is cuter than node B. Keep this graph in mind.

Now, let me tell you about a problem the Navy and the Air Force solve differently. They both want to get a group of people in height order and quickly. The Navy tells people to get in line, then if you're shorter than the guy in front of you, switch places, and repeat until done. In the worst case, it's N*N comparison.

The Air Force tells people to stand in a square grid. They shuffle front-to-back on sqrt(N) people, which means worst case sqrt(N)sqrt(N) == N comparisons. However, the people are only sorted along one dimension. So therefore, the people face left, then do the same shuffle again. Now we're up to 2N comparisons, and the sort is still imperfect but it's good enough for government work. There's a short corner, a tall corner opposite, and a clear diagonal height gradient.

You can see how the Air Force method gets results in less time if you don't care about perfection. You can also see how to get the perfection effectively. You already know that the very shortest and very longest men are in two corners. The second-shortest might be behind or beside the shortest, the third shortest might be behind or beside him. In general, someone's height rank is also his maximum possible Manhattan distance from the short corner.

Looking back at the graph analogy, the eight nodes to present each round are eight of those with the currently most common length of longest inbound path. The length of the longest inbound path also represents the node's minimum possible sorted rank.

You'll use a lot of CPU following this plan, but you will make the best possible use of your human resources.

Ian