One of my friend been asked with a question, Retrieving the max top 100 numbers from one hundred million of numbers, in a recent job interview. Do you have any idea to come up with an efficient way to solve it?
regards!
One of my friend been asked with a question, Retrieving the max top 100 numbers from one hundred million of numbers, in a recent job interview. Do you have any idea to come up with an efficient way to solve it?
regards!
Run them all through a min-heap of size 100: for each input number k
, replace the current min m
with max(k, m)
. Afterwards the heap holds the 100 largest inputs.
A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.
Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest()
:
import heapq
def funnel(n, numbers):
if n == 0: return []
heap = numbers[:n]
heapq.heapify(heap)
for k in numbers[n:]:
if heap[0] < k:
heapq.heapreplace(heap, k)
return heap
>>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8])
[5, 8, 6, 9]
By TOP 100
, do you mean 100 largest? If so:
SELECT TOP 100 Number FROM RidiculouslyLargeTable ORDER BY Number DESC
Make sure you tell the interviewer that you assume the table is indexed properly.
Mergesort in batches of 100, then only keep the top 100.
Incidentally, you can scale this in all sorts of directions, including concurrently.
Ok, here is a really stupid answer, but it is a valid one:
Reasoning:
It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.
As such, the question is really hard to answer - efficiency first has to be defined.
int numbers[100000000000] = {...};
int result[100] = {0};
for( int i = 0 ; i < 100000000000 ; i++ )
{
for( int j = 0 ; j < 100 ; j++ )
{
if( numbers[i] > result[j] )
{
if( j < 99 )
{
memcpy(result+j+1, result+j, (100-j)*sizeof(int));
}
result[j] = numbers[i];
break;
}
}
}
If the data is already in an array that you can modify, you could use a variant of Hoare's Select algorithm, which is (in turn) a variant of Quicksort.
The basic idea is pretty simple. In Quicksort, you partition the array into two pieces, one of items larger than the pivot, and the other of items smaller than the pivot. Then you recursively sort each partition.
In the Select algorithm, you do the partitioning step exactly as before -- but instead of recursively sorting both partitions, you look at which partition contains the elements you want, and recursively select ONLY in that partition. E.g., assuming your 100 million items partition nearly in half, the first several iterations you're going to look only at the upper partition.
Eventually, you're likely to reach a point where the portion you want "bridges" two partitions -- e.g., you have a partition of ~150 numbers, and when you partition that you end up with two pieces of ~75 apiece. At that point, only one minor detail changes: instead of rejecting one partition and continuing work only the other, you accept the upper partition of 75 items, and then continue looking for the top 25 in the lower partition.
If you were doing this in C++, you could do this with std::nth_element
(which will normally be implemented approximately as described above). On average, this has linear complexity, which I believe is about as good as you can hope for (absent some preexisting order, I don't see any way to find the top N elements without looking at all the elements).
There's no reason to sort the whole list. This should be doable in O(n) time. In pseudocode:
List top = new List
for each num in entireList
for i = 0 to top.Length
if num > top[i] then
top.InsertBefore(num, i)
if top.Length > 100 then
top.Remove(top.Length - 1)
end if
exit for
else
if i = top.Length - 1 and i < 100 then
top.Add(num)
end if
end if
next
next
I think probably the divide-and-concer method could work well.
or there is some good data structure to support?