ansaurus

Question

Retrieving the top 100 numbers from one hundred million of numbers

Answer 1

+26 A:

Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

import heapq

def funnel(n, numbers):
    if n == 0: return []
    heap = numbers[:n]
    heapq.heapify(heap)
    for k in numbers[n:]:
        if heap[0] < k:
            heapq.heapreplace(heap, k)
    return heap

>>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8])
[5, 8, 6, 9]

Darius Bacon 2010-03-31 05:15:34

+1: But only replace the min if the new number is >min, though... (or use a min-heap of 101 elements)

Reed Copsey 2010-03-31 05:18:10

O(n*log(m)), not too shabby.

Sparr 2010-03-31 05:21:27

Oops, quite right, Reed.

Darius Bacon 2010-03-31 05:26:56

@Darius Bacon: shouldn't it be "replace the current min m with `max(k, m)`". Did I understand something wrong here?

Lazer 2010-03-31 08:24:59

+1 An elegant solution since you leverage a very beautiful data structure properly. Well done.

gotgenes 2010-03-31 20:15:12

Answer 2

+1 A:

By TOP 100, do you mean 100 largest? If so:

SELECT TOP 100 Number FROM RidiculouslyLargeTable ORDER BY Number DESC

Make sure you tell the interviewer that you assume the table is indexed properly.

mcliedtk 2010-03-31 05:25:11

Who said anything about a DB?

Reed Copsey 2010-03-31 05:27:00

Who said not in a db? With 100 million data points ASSSUMING they are not in a text file is sensible - after all, this is a typical OLTP / Data warehouse type of query.

TomTom 2010-03-31 05:30:36

sorry,but if this is not a question about SQL. assume that interviewer needs an algorithm to solve this problem by writing programming code or pseudo-code.

didxga 2010-03-31 05:31:01

Yes, the question did not mention a database, though it did not mention any sort of data structure to begin with. In an actual interview, I would ask for clarification, but the question is vague enough to suggest that this answer is 'one' solution. I don't see how assuming the numbers are in an array, a file, a tree, and so forth is any different from assuming the presence of a database.

mcliedtk 2010-03-31 05:52:11

If I asked that question in an interview and received that answer, I would probably say something like, "Great! Now supposing you are writing the RDBMS, how would you implement that query?"

hemp 2010-03-31 06:17:27

Eric Lippert explained this situation very well in this blog entry (http://blogs.msdn.com/ericlippert/archive/2009/03/20/it-s-not-magic.aspx) where he said, "Where I start to see magical thinking is when I ask the candidate to assume that their chosen solution is simply not yet implemented on their platform."

hemp 2010-03-31 06:26:53

@hemp, "Why in the world are you asking me about how to develop an RDBMS, when I'm applying as Perl web developer?"

Pavel Shved 2010-03-31 12:40:38

Because even a lowly Perl web developer should be a solid enough engineer that you can reason intelligibly about how you might implement that query. I'm not expecting that you've done it before, or that you will ever do it in the future. I don't hire based on experience; I hire based on demonstrable intelligence and ability to solve problems using basic computer science fundamentals.

hemp 2010-04-01 16:59:36

Answer 3

+2 A:

Mergesort in batches of 100, then only keep the top 100.

Incidentally, you can scale this in all sorts of directions, including concurrently.

MSN 2010-03-31 05:34:25

Answer 4

+6 A:

Ok, here is a really stupid answer, but it is a valid one:

Load all 100 million entries into an array
Call some quick sort implementation on it
Take last 100 items (it sorts ascending), or first 100 if you can sort descending.

Reasoning:

There is no context on the question, so efficiency can be argued - what IS efficient? Computer time or programmer time?
This method is implementable very fast.
100 million entries - numbers, are just a couple of hundred mb, so every decent workstaiton can simply run that.

It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.

As such, the question is really hard to answer - efficiency first has to be defined.

TomTom 2010-03-31 05:36:09

Yes, but as pointed out in my answer, (at least if you're using the right language) it's just as easy to do the job more efficiently.

Jerry Coffin 2010-03-31 17:58:25

Answer 5

A:

int numbers[100000000000] = {...};
int result[100] = {0};
for( int i = 0 ; i < 100000000000 ; i++ )
{
    for( int j = 0 ; j < 100 ; j++ )
    {
         if( numbers[i] > result[j] )
         {
              if( j < 99 )
              {
                  memcpy(result+j+1, result+j, (100-j)*sizeof(int));
              }
              result[j] = numbers[i];
              break;
         }
    }
}

chris 2010-03-31 14:54:57

Answer 6

+1 A:

If the data is already in an array that you can modify, you could use a variant of Hoare's Select algorithm, which is (in turn) a variant of Quicksort.

The basic idea is pretty simple. In Quicksort, you partition the array into two pieces, one of items larger than the pivot, and the other of items smaller than the pivot. Then you recursively sort each partition.

In the Select algorithm, you do the partitioning step exactly as before -- but instead of recursively sorting both partitions, you look at which partition contains the elements you want, and recursively select ONLY in that partition. E.g., assuming your 100 million items partition nearly in half, the first several iterations you're going to look only at the upper partition.

Eventually, you're likely to reach a point where the portion you want "bridges" two partitions -- e.g., you have a partition of ~150 numbers, and when you partition that you end up with two pieces of ~75 apiece. At that point, only one minor detail changes: instead of rejecting one partition and continuing work only the other, you accept the upper partition of 75 items, and then continue looking for the top 25 in the lower partition.

If you were doing this in C++, you could do this with std::nth_element (which will normally be implemented approximately as described above). On average, this has linear complexity, which I believe is about as good as you can hope for (absent some preexisting order, I don't see any way to find the top N elements without looking at all the elements).

Jerry Coffin 2010-03-31 17:54:12

This takes O(n) time, but also O(n) space.

Darius Bacon 2010-03-31 19:33:47

@Darius:Yes, that's quite true. Depending on the situation, that could be a huge problem, or none at all -- in particular, the question says you're given the numbers; if you're given them already in memory, then it makes about the most effective possible use of memory space -- virtually none extra at all. On the other hand, if you're given the data in a file (for example) allocating that much space could be a real problem (allocating virtual memory could take longer than your solution would to run, just for one example).

Jerry Coffin 2010-03-31 19:49:26

Answer 7

A:

There's no reason to sort the whole list. This should be doable in O(n) time. In pseudocode:

List top = new List

for each num in entireList
    for i = 0 to top.Length
        if num > top[i] then
            top.InsertBefore(num, i)
            if top.Length > 100 then
                top.Remove(top.Length - 1)
            end if
            exit for
        else
            if i = top.Length - 1 and i < 100 then
                top.Add(num)
            end if
        end if
    next
next

tloflin 2010-03-31 19:29:39

For this to work, you would have to keep the "top" List sorted.

mbeckish 2010-03-31 19:38:33

Yeah, fixed, I forgot about that part. Still O(n), I think.

tloflin 2010-03-31 19:45:18

Your algorithm is only O(n) because you're asked to find the top 100, rather than the top `k`, where `k < n`. In the worst case scenario, the original list is sorted from smallest to largest, and so, for each number in the original list, you will always scan to the end of `top`. If instead you were asked to find the top `k`, your algorithm runs in O(kn), which may be poor for large `k` and `n`.

gotgenes 2010-03-31 20:10:42

Also, `if i = top.Length - 1` should be `if `**`==`**` top.Length - 1`.

gotgenes 2010-03-31 20:12:26

"Your algorithm is only O(n) because you're asked to find the top 100." - That is correct. Of course if the problem were different it would require a different algorithm. But 100, compared to 100 million, is not an issue. Also, my code is pseudocode, so as long as you understood it, the syntax is irrelevant.

tloflin 2010-03-31 20:42:46

Answer 8

A:

I think probably the divide-and-concer method could work well.

or there is some good data structure to support?

ken 2010-04-30 14:30:33

ansaurus

tags:

views:

answers:

Retrieving the top 100 numbers from one hundred million of numbers

related questions