Finding median of large set of numbers too big to fit into memory

views:

203

answers:

+3 Q:

Finding median of large set of numbers too big to fit into memory

I was asked this question in an interview recently.

There are N numbers, too many to fit into memory. They are split across k database tables (unsorted), each of which can fit into memory. Find the median of all the numbers.

Wasn't quite sure about the answer to this one.

+2 A:

Have a look at the "Median of Medians" algorithm in this wikipedia article.

Related question: Median-of-medians in Java.

Explanation: http://www.ics.uci.edu/~eppstein/161/960130.html

aioobe 2010-10-08 06:02:51

Does this require all the sublists to be the same size? What if they are not?

garsh0p 2010-10-08 06:06:57

Good question. Probably. You could of course also process each table 5 (or 7, ..) rows at a time... But then I suppose you may be violating the memory constraints.

aioobe 2010-10-08 06:13:35

For the median of medians you still need to do pivoting. It is an in memory approach. In this case you need to use a streaming approach. In the first pass you find max and min of the sample. Call that U and L for upper an lower bounds. Then set a candidate median at the average of U and L. With a second pass you count the probability of X < candidate. If it's .5 you are done if not, if it's lower you replace L with the candidate, and the candidate with average of candidate and upper. In the other case, same but swap the roles of upper and lower. Essentially it's binary search among possible medians. If you can afford additional disk space, you can at each step filter all the elements between U and L and work only on those going forward, since the median is guaranteed to be included. This brings the complexity down to linear time, just as that of pivoting.

piccolbo 2010-10-08 17:25:00

ansaurus

tags:

views:

answers:

Finding median of large set of numbers too big to fit into memory

related questions