Is it possible to calculate median of a list of numbers better than O(n log n)?

views:

936

answers:

+4 Q:

Is it possible to calculate median of a list of numbers better than O(n log n)?

I know that it is possible to calculate the mean of a list of numbers in O(n). But what about the median? Is there any better algorithm than sort (O(n log n)) and lookup middle element (or mean of two middle elements if an even number of items in list)?

+1 A:

This link has popped up recently on calculating median: http://matpalm.com/median/question.html .

In general I think you can't go beyond O(n log n) time, but I don't have any proof on that :). No matter how much you make it parallel, aggregating the results into a single value takes at least log n levels of execution.

Zed 2009-08-21 13:01:31

I changed your answer from "O(log n)" to "O(n log n)", which is what you were looking for, I think, given the question and the rest of your answer.

Beska 2009-08-21 13:19:07

+12 A:

Yes. You can do it (deterministically) in O(n).

Tyler McHenry 2009-08-21 13:02:05

That link talks about a "median of medians", or in other words, an approximation of the "true" median. I'm not sure that's what the OP is asking for.

Chris Jester-Young 2009-08-21 13:15:39

Using deterministic selection you get the real median.See here: http://en.wikipedia.org/wiki/Selection_algorithm

Anna 2009-08-21 13:29:04

I see that Pesto wrote this already.

Anna 2009-08-21 13:36:38

@Chris Jester-Young: It does talk about a "median of medians", but only as an intermediate value in the algorithm - not the result! This algorithm does find the median (not median of medians) in O(N), worst case, time.

yairchu 2009-08-21 16:43:55

+1. But note that it needs 24n comparisons, meaning that it is likely to be much slower than the randomised method, which averages 1.5n comparisons. (Numbers taken or inferred from the last two paras of the linked page.)

j_random_hacker 2009-08-22 17:49:15

+4 A:

If the numbers are discrete (e.g. integers) and there is a manageable number of distinct values, you can use a "bucket sort" which is O(N), then iterate over the buckets to figure out which bucket holds the median. The complete calculation is O(N) in time and O(B) in space.

Stephen C 2009-08-21 13:06:46

+9 A:

What you're talking about is a selection algorithm, where k = n/2. There is a method based on the same partitioning function used in quicksort which works. It is called, not surprisingly, quickselect. While it can, like quicksort, have a O(n²) worst case, this can be brought down to linear time using the proper pivot selection.

Pesto 2009-08-21 13:16:37

+3 A:

Partially irrelevant, but: a quick tip on how to quickly find answers to common basic questions like this on the web.

We're talking about medians? So Gg to the page about medians in wikipedia
Search page for algorithm:

Efficient computation of the sample median

Even though sorting n items takes in general O(n log n) operations, by using a "divide and conquer" algorithm the median of n items can be computed with only O(n) operations (in fact, you can always find the k-th element of a list of values with this method; this is called the selection problem).

Follow the link to the selection problem for the description of algorithm. Read intro:

... There are worst-case linear time selection algorithms. ...

And if you're interested read about the actual ingenious algorithm.

yairchu 2009-08-21 16:52:43

+2 A:

Just for fun (and who knows, it may be faster) there's another randomized median algorithm, explained technically in Mitzenmacher's and Upfall's book. Basically, you choose a polynomially-smaller subset of the list, and (with some fancy bookwork) such that it probably contains the real median, and then use it to find the real median. The book is on google books, and here's a link. Note: I was able to read the pages of the algorthm, so assuming that google books reveals the same pages to everyone, you can read them too.

It is a randomized algorithm s.t. if it finds the answer, it is 100% certain that it is the correct answer (this is called Las Vegas style). The randomness arises from the runtime --- occasionally (with probability 1/(sqrt(n)), I think) it FAILS to find the median, and must be re-run.

Asymptotically, it is exactly linear when you take into the chance of failure --- that is to say, it is a wee bit less than linear, exactly such that when you take into account the number of times you may need to re-run it, it becomes linear.

Note: I'm not saying this is better or worse --- I certainly haven't done a real-life runtime comparison between these algorithms! I'm simply presenting an additional algorithm that has linear runtime, but works in a significantly different way.

Agor 2009-08-21 17:00:46

Interesting stuff!

j_random_hacker 2009-08-22 17:50:18

ansaurus

tags:

views:

answers:

Is it possible to calculate median of a list of numbers better than O(n log n)?

related questions