Pre-sorting analysis algorithm?

views:

418

answers:

+7 Q:

Pre-sorting analysis algorithm?

It's a well-known isssue with Quicksort that when the data set is in or almost in sort order, performance degrades horribly. In this case, Insertion Sort, which is normally very slow, is easily the best choice. The question is knowing when to use which.

Is there an algorithm available to run through a data set, apply a comparison factor, and return a report on how close the data set is to being in sort order? I prefer Delphi/Pascal, but I can read other languages if the example isn't overly complex.

I've not heard of any pre-sorting analysis but my opinion is that if you are going to go through the dataset to analyze it then you are already cutting into performance of your overall sorting time.

martinatime 2009-12-04 20:07:22

That's a good point, but if the analysis pass is O(n), it will not dominate the asymptotic sorting time. And if it can help avoid a O(n^2) worst-case sorting time, it could be a net benefit in sorting time for large datasets.

ddaa 2009-12-04 20:14:07

@ddaa: That would be true for comparison sorts, but O(n) sorting is possible with Radix Sort, or Bucket Sort. If we include these algorithms the sort time could be dominated by the analysis time...

Jason Punyon 2009-12-04 20:28:48

@Jason: You wouldn't perform this analysis on data which you are about to bucket sort. The question is about choosing between quicksort and insertion sort, and you're planning to do neither...

Steve Jessop 2009-12-04 20:59:25

@Steve Jessop: +1 Good point.

Jason Punyon 2009-12-05 16:02:51

One possible solution is to take first, last and the middle element in the current sort range (during the QuickSort operation) and chose the middle one as the pivot element.

gabr 2009-12-04 20:13:27

Your best case is still O(N log N), where Insertion sort is O(N) for nearly sorted data.

wowest 2009-12-04 20:15:13

To fully analyze for the purpose of deciding which algorithm to use, you are going to do nearly the work of sorting. You could do something like check the values at a small percentage of random but increasing indexes (ie analyze a small sample of the items).

BioBuckyBall 2009-12-04 20:13:35

+3 A:

There's also SmoothSort, which is apparently quite tricky to implement, but it varies between O(N log N) to O(N) depending on how sorted the data is to start with.

http://en.wikipedia.org/wiki/Smoothsort

Long tricky PDF: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796a.PDF

However, if your data is truly huge and you have to access it serially, mergesort is probably the best. It's always O(N log N) and it has excellent 'locality' properties.

wowest 2009-12-04 20:14:25

You would still have to run through all records to determine if its sorted or not, so to improve performance, start with your first record and run though the rest until you either notice something not properly sorted, or reach the end of the list. If you find miss then only sort items from that position to the end (since the beginning of the list is already sorted).

At each item in the second part, see if the item is < than the last element in the first part and if so use an insertion sort into ONLY the first part. Otherwise Quicksort against all other items in the second part. This way the sort is optimized for the specific case.

skamradt 2009-12-04 20:38:25

QuickSort beng a problem only when the data set is huge and already mostly sorted, I would use the following heuristics (pending a full blown solution):

Don't bother if data set size is below threshold.
If you have a quick (indexed) access to records(items) take a sample with 1 record in every N records and see if they are already sorted. Should be quick enough for a small sample and you can then decide to use quick sort or not.

François 2009-12-04 20:48:48

but the sample fails if 1 record in every N is sorted, but +1 record in every N isn't. you may still have to read every record to see if ONE of them not sampled is out of order.

skamradt 2009-12-04 21:40:43

Agreed, but there is statistically very little chance that the sample would deviate so much from the overall population, esp if you randomize a little bit N.

François 2009-12-05 00:34:28

+7 A:

As you'd expect quite a lot of thought goes into this. The median-of-three technique means that quicksort's worst case behaviour doesn't occur for sorted data, but instead for less obvious cases.

Introsort is quite exciting, since it avoids quicksort's quadratic worst case altogether. Instead of your natural question, "how do I detect that the data is nearly-sorted", it in effect asks itself as it's going along, "is this taking too long?". If the answer is yes, it switches from quicksort to heapsort.

Timsort combines merge sort with insertion sort, and performs very well on sorted or reverse-sorted data, and on data that includes sorted or reverse-sorted subsets.

So probably the answer to your question is, "you don't need a pre-pass analysis, you need an adaptive sort algorithm".

Steve Jessop 2009-12-04 20:49:42

+1 for timsort link

Peter Recore 2009-12-04 21:10:45

+1 wow, timsort does look quite neat.

wowest 2009-12-04 21:28:25

To make a conceptual point that people haven't yet made: Quicksort is a common-sense divide-and-conquer algorithm with an obvious bug in rare cases. Suppose that you want to sort a stack of student papers. (Which I have to do with some regularity.) In the quicksort algorithm, you pick some paper, the pivot. Then divide the other papers according to whether they are before or after the pivot. Then repeat that with the two subpiles. What's the bug? The pivot could be a name that is near one end of the list instead of in the middle, so that it doesn't accomplish much to divide it into two piles.

Merge sort is another divide-and-conquer algorithm that works in a different order. You can merge two sorted lists in linear time. Divide the papers into two equal or nearly equal piles, then recursively sort each one, then merge. Merge sort doesn't have any bugs. One reason that quicksort is more popular than merge sort is historical: Quicksort is fast (usually) and it works without any extra memory. But these days, it can be more important to save comparisons than to save memory, and the actual rearrangement is often abstracted by permuting pointers. If things had always been that way, then I suspect that merge sort would simply have been more popular than quicksort. (And maybe adding "quick" to the name was good salesmanship.)

Greg Kuperberg 2009-12-06 23:00:29

ansaurus

tags:

views:

answers:

Pre-sorting analysis algorithm?

related questions