A: 

Split the 10^9 numbers, 10^7 to each computer ~ 80MB on each. Each computer sorts its numbers. Then computer 1 merge-sorts its own numbers with those from computer 2, computer 3 and 4, etc ... Then computer 1 writes half of the numbers back to 2, 3 to 4, etc. Then 1 merge sorts the numbers from computers 1,2,3,4, writes them back. And so on. Depending on the size of RAM on the computers you may get away with not writing all the numbers back to the individual computers at each step, you might be able to accumulate the numbers on computer 1 for several steps, but you do the maths.

Oh, finally get the mean of the 500000000th and 500000001st values (but check there are enough 00s in there, I haven't).

EDIT: @Roman -- well if you can't believe it even it it's true then there's no point in my revealing the truth or falsehood of the proposition. What I meant to state was that brute force sometimes beats smart in a race. It took me about 15 seconds to devise an algorithm which I am confident that I can implement, which will work, and which will be adaptable to a wide range of sizes of inputs and numbers of computers, and tunable to the characteristics of the computers and networking arrangements. If it takes you, or anyone else, say 15 minutes to devise a more sophisticated algorithm I have a 14m45s advantage to code up my solution and start it running.

But I freely admit this is all assertion, I haven't measured anything.

High Performance Mark
here we are just mergesorting all numbers. Can we do it in a better way using:- "we can find the median of two sorted lists in logn time. n is the length of each list."
anony
@anony -- while you answer your own question, I'll have my solution coded up, tested and done. I expect that there are better ways, but sometimes parallelising a simple way leaves me free to scratch my head on the really difficult problems.
High Performance Mark
@High Performance Mark: have you really done it in **7** minutes? I can't believe that even if it's true. I did the similar task (it was a university assignment) and it took about 2 hours to implement and test all remoting stuff (I used java RMI).
Roman
@High Performance Mark: I see what you're saying, but by the same token DrPizza has an even quicker-to-think-of solution, which is to sort all the data on a single node and ignore the other 99. None of us knows how expensive data transfer should be considered, so we're all just picking a compromise that sounds vaguely plausible. Your solution transfers all the data multiple times, so I'm a bit suspicious of it, but it's certainly a solution.
Steve Jessop
'vaguely plausible' -- that's good enough for me @Steve ! Especially in response to a vaguely implausible question.
High Performance Mark
A: 

One computer is more than enough to solve the problem.

But let's assume that there are 100 computers. The only complex thing you should do is to sort the list. Split it to 100 parts, send one part to each computer, let them be sorted there, and merge parts after that.

Then take number from the middle of the sorted list (i.e. with index 5 000 000 000).

Roman
Why is it downvoted?
Roman
Anyway now my rep is pretty round :)
Roman
-1 because sorting is not necessary for finding a median.
Pavel Shved
Merging is at best O(n), and you can find the median on a single core in O(n), so this seems to create a lot of extra work for no gain.
Rex Kerr
+7  A: 

I'd use "median of medians" algorithm. The idea is that, for any x > 1, if we can split array into groups of 2x+1 elements and find a median of each group, and find the median of these medians, then this element partitions the array (as in quicksort) in such a way that each partition will be at least 25% long. The algorithm, recursively applied, yields the median in linear time.

Here's a way to parallelize. As in the original algorithm, choose x equal to two (or to another relatively small number). This gives us a pool of small tasks, each of which can be comuted on a separate node. However, partitioning should be done at master node.

The upside is that it doesn't require big memory on each computer. The downside is a greater payload to the network.

However, the algorithm may be greatly improved if you parallelize partitioning as well (honestly, partitioning is somewhat more important than seleciton of the pivot). Some ideas are proposed by Steve Jessop in comments to this entry.

Pavel Shved
I think you have overstated the power of the "median of medians". For example, consider the numbers 1 - 9, split into 3 parts {1,2,5}, {3,4,6}, {7,8,9}. The median of the part-wise medians is 4.
Steve Jessop
@Steve, that was a great FAIL, indeed. But now I fixed my solution.
Pavel Shved
would you explain this solution more clearly please - how will you recursively apply the algo
anony
@anony, scroll the wiki page up a bit, and read the section "Partition-based general selection algorithm"
Pavel Shved
"partitioning should be done at master node" - but partitioning is all that quickselect does. I think that to parallelize this, you need to do some partitioning on other nodes than the master. I'm not quite sure how, but for example one node could send "big" values one way and "small" values another, to two nodes which could each start partitioning "their" sections. By the time the master partition is complete, each of the others is nearly complete, and then the master "cancels" the useless one. As long as there are nodes free, the sub-selects could have been likewise passing on data.
Steve Jessop
So you'd need a decent scheduler, to tell each node where it should be sending each of the two parts of its output, and also to cancel useless jobs and reassign that node to take the output of a non-useless job. Data cascading all over the place.
Steve Jessop
"partitioning is somewhat more important than seleciton of the pivot" - in particular, you describe selecting the pivot in terms of calculating medians, *which is where we started*. Chopping an array up into chunks and calculating the median of each chunk takes a long time, since quickselect is only O(N) expected time to start with. The median-of-medians doesn't simplify the problem, it just improves the worst case. Hence the need to parallelize the partitioning.
Steve Jessop
@Steve, after thinking, I guess you're right, and my solution is crap. I wish I could downvote self :-)
Pavel Shved
+2  A: 
sort -g numbers | head -n 500000001 | tail -n 2 | dc -e "1 k ? ? + 2 / p"
DrPizza
LOL. Does that really work or will the OOM killer nuke it before it completes? (on any reasonable computer)
Isak Savo
Should do. sort knows how to do an out-of-core sort, so it won't run out of memory.
DrPizza
+6  A: 

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.

Steve Jessop
+1  A: 

How about this:- each node can take 1Billion/100 numbers. At each node the elements can be sorted and median can be found. Find the median of medians. we can, by aggregating the counts of numbers less than median-of-median on all nodes find out x%:y% split which the median-of-medians makes. Now ask all nodes to delete elements less than the median of medians( taking example of 30%:70% split).30% numbers are deleted. 70% of 1Billion is 700million. Now all nodes which deleted less than 3million nodes can send those extra nodes back to a main computer. The main computer redistributes in such a way that now all nodes will have almost equal number of nodes(7million). Now that the problem is reduced to 700million numbers.... goes on until we have a smaller set which can be computed on one comp.

anony
In essence we are always reducing the problem set by at least 30% and we are achieving a lot of parallel computing through this. Each node starts with 10million and reduces its data set by 30% in each iteration.
anony
In the first iteration we look for 500Millionth number. In second iteration - if number of numbers deleted is 300million then we look for 200millionth number and so on...
anony
This looks like it's on the right track, but you don't explain very clearly how to avoid throwing away the median by accident with your 30%/70% split. Take the following counterexample: suppose your first 29% is all zeros, and all other blocks count up by 1000, and each set of blocks is one more than the last. The 30th percentile median will throw away all of 29% of the data, and just under half of 61% of the data, which is 29+30% = 59% of the data. Oops, we just threw out the true median! So apparently you don't mean that, or at least you mean it more cleverly than I interpreted.
Rex Kerr
+1  A: 

Oddly enough, I think if you have enough computers, you're better off sorting than using O(n) median-finding algorithms. (Unless your cores are very, very slow, though, I'd just use one and use an O(n) median-finding algorithm for merely 1e9 numbers; if you had 1e12, though, that might be less practical.)

Anyway, let's suppose we have more than log n cores to deal with this problem, and we don't care about power consumption, just getting the answer fast. Let's further assume that this is a SMP machine with all the data already loaded in memory. (Sun's 32-core machines are of this type, for instance.)

One thread chops the list up blindly into equal sized pieces and tells the other M threads to sort them. Those threads diligently do so, in (n/M) log (n/M) time. They then return not only their medians, but, say, their 25th and 75th percentiles as well (perverse worst cases are better if you choose slightly different numbers). Now you have 4M ranges of data. You then sort these ranges and work upwards through the list until you find a number such that, if you throw out every range that is smaller than or contains the number, you will have thrown out half your data. That's your lower bound for the median. Do the same for the upper bound. This takes something like M log M time, and all cores have to wait for it, so it's really wasting M^2 log M potential time. Now you have your single thread tell the others to toss all data outside the range (you should throw out about half on each pass) and repeat--this is a trivially fast operation since the data is already sorted. You shouldn't have to repeat this more than log(n/M) times before it's faster to just grab the remaining data and use a standard O(n) median finder on it.

So, total complexity is something like O((n/M) log (n/M) + M^2 log M log (n/M)). Thus, this is faster than O(n) median sort on one core if M >> log(n/M) and M^3 log M < n, which is true for the scenario you've described.

I think this is a really bad idea given how inefficient it is, but it is faster.

Rex Kerr
A: 

I hate to be the contrarian here, but I don't believe sorting is required, and I think any algorithm involving sorting a billion/100 numbers is going to be slow. Let's consider an algorithm on one computer.

1) Select 1000 values at random from the billion, and use them to get an idea of the distribution of the numbers, especially a range.

2) Instead of sorting the values, allocate them to buckets based on the distribution you just calculated. The number of buckets is chosen so that the computer can handle them efficiently, but should otherwise be as large as convenient. The bucket ranges should be so that approximately equal numbers of values go in each bucket (this isn't critical to the algorithm, but it helps efficiency. 100,000 buckets might be appropriate). Note the number of values in each bucket. This is an O(n) process.

3) Find out which bucket range the median lies. This can be done by simply examining the total numbers in each bucket.

4) Find the actual median by examining the values in that bucket. You can use a sort here if you like, since you are only sorting maybe 10,000 numbers.

This approach parallelizes trivially by dividing the values between the computers. Each computer reports the totals in each bucket to a 'control' computer which does step 3. For step 4 each computer sends the (sorted) values in the relevant bucket to the control computer (you can do both of those algorithms in parallel too, but it probably isn't worth it).

The total process is O(n), since both steps 3 and 4 are trivial, provided the number of buckets is large enough.

DJClayworth