I know that it is possible to calculate the mean of a list of numbers in O(n). But what about the median? Is there any better algorithm than sort (O(n log n)) and lookup middle element (or mean of two middle elements if an even number of items in list)?
views:
936answers:
6This link has popped up recently on calculating median: http://matpalm.com/median/question.html .
In general I think you can't go beyond O(n log n) time, but I don't have any proof on that :). No matter how much you make it parallel, aggregating the results into a single value takes at least log n levels of execution.
If the numbers are discrete (e.g. integers) and there is a manageable number of distinct values, you can use a "bucket sort" which is O(N), then iterate over the buckets to figure out which bucket holds the median. The complete calculation is O(N) in time and O(B) in space.
What you're talking about is a selection algorithm, where k = n/2
. There is a method based on the same partitioning function used in quicksort which works. It is called, not surprisingly, quickselect. While it can, like quicksort, have a O(n2) worst case, this can be brought down to linear time using the proper pivot selection.
Partially irrelevant, but: a quick tip on how to quickly find answers to common basic questions like this on the web.
- We're talking about medians? So Gg to the page about medians in wikipedia
- Search page for algorithm:
Efficient computation of the sample median
Even though sorting n items takes in general O(n log n) operations, by using a "divide and conquer" algorithm the median of n items can be computed with only O(n) operations (in fact, you can always find the k-th element of a list of values with this method; this is called the selection problem).
- Follow the link to the selection problem for the description of algorithm. Read intro:
... There are worst-case linear time selection algorithms. ...
- And if you're interested read about the actual ingenious algorithm.
Just for fun (and who knows, it may be faster) there's another randomized median algorithm, explained technically in Mitzenmacher's and Upfall's book. Basically, you choose a polynomially-smaller subset of the list, and (with some fancy bookwork) such that it probably contains the real median, and then use it to find the real median. The book is on google books, and here's a link. Note: I was able to read the pages of the algorthm, so assuming that google books reveals the same pages to everyone, you can read them too.
It is a randomized algorithm s.t. if it finds the answer, it is 100% certain that it is the correct answer (this is called Las Vegas style). The randomness arises from the runtime --- occasionally (with probability 1/(sqrt(n)), I think) it FAILS to find the median, and must be re-run.
Asymptotically, it is exactly linear when you take into the chance of failure --- that is to say, it is a wee bit less than linear, exactly such that when you take into account the number of times you may need to re-run it, it becomes linear.
Note: I'm not saying this is better or worse --- I certainly haven't done a real-life runtime comparison between these algorithms! I'm simply presenting an additional algorithm that has linear runtime, but works in a significantly different way.