views:

6416

answers:

7

When implementing Quicksort one of the things you have to do is choose a pivot. But when I look at pseudocode like the one below. It is not clear how is should choose the pivot. First element of list? Something else?

 function quicksort(array)
     var list less, greater
     if length(array) ≤ 1  
         return array  
     select and remove a pivot value pivot from array
     for each x in array
         if x ≤ pivot then append x to less
         else append x to greater
     return concatenate(quicksort(less), pivot, quicksort(greater))

Can someone help me grasp the concept of choosing a pivot and whether or not different senarios call for different strategies.

Thanks

+14  A: 

Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.

Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).

Kip
I would second the notion that implementing a search yourself might not be worth the effort. Also, be careful how you're picking random numbers, since random number generators are kinda slow sometimes.
PeterAllenWebb
+5  A: 

It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. Median-of-3 (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing at random. Database records can be costly to compare.

Jonathan Leffler
Median of 3 is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.
mdkess
@mdkess: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.1103 supports your contention: http://www.people.carleton.edu/~grossm/info.html supports mine. I'd not heard of 3 random elements before, but it makes sense.
Jonathan Leffler
@mdkess: There's an article http://portal.acm.org/citation.cfm?id=1176.1243 I'd like to see what it is talking about, but the text isn't available AFAICT.
Jonathan Leffler
@mdkess: http://www.lsi.upc.edu/~conrado/research/papers/rsa-kpm97.pdf discusses a random 3 version. Google 'median-of-three' works pretty well for further tracking. Thanks for the information; I've only encountered the deterministic median-of-three before.
Jonathan Leffler
Jonathan Leffler
+1  A: 

If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.

If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.

However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)

James Curran
+1  A: 

It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.

Joe Philllips
+3  A: 

Heh, I just taught this class.

There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input) Better: Pick the item in the middle of the range. (better on partially sorted input)

However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).

One improvement I've seen is pick median(first, last, mid); In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.

For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).

If you're still running into problems, then go the median route.

chris
A: 

Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n^2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.

A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, n lg n time).

If you absolutely want to guarantee O(nlgn) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right.) By the Master Theorem, this is O(n lg n). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlgn) time (and will be much faster than this lame median quicksort).

mdkess
+2  A: 
paperhorse