views:

1153

answers:

10

Hi all: I went to an interview today and here is the question:

Suppose you have got 1 billion integers which are unsorted at one disk file,please find out the largest 100 numbers,try the best solution as you can.

Anybody who has ideas,please do share!

Thanks in advance!

+7  A: 

I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.

Calling the number of values in the list n and the number of values to find m:

this is O(n * log m)

JoshD
This is not as efficient as mine since everytime you have to check the max of the set. Virtual -1.
Aliostad
Given a list of 500,450,105 would the number 480 be added in? I think you mean only insert the value if it is greater than the MIN of the set.
Tom Gullen
@Tom Gullen: quite right. Thanks for catching my meaning :)
JoshD
-1: you should add a new integer to the set if it is greater than the smallest integer in the set. Suppose that the size of the selected set is 4 and you have [5,7,9,12] and read the number 8 ...
High Performance Mark
@High Performance Mark: yes, I changed that a minute ago. Thanks.
JoshD
See my answer for a much faster way of insertion
Tom Gullen
@JoshD: and you have repeated your min/max confusion in your comment on @Allostad's solution.
High Performance Mark
@High Performance Mark: Thanks for pointing that out. I've removed that and corrected my answer.
JoshD
@JoshD: Your solution however is at least O(n). Calling it "O(1000000000 * log 100)" would mean be equivalent to constant time complexity (O(1)).
Novox
@Novox: I've clarified the answer. Thanks for the feedback.
JoshD
+8  A: 

Create an array of 100 numbers all being -2^31.

Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.

When you've finished reading all 1 billion digits you should have the highest 100 in the array.

Job done.

Goz
1+ for being exactly like mine!
Aliostad
Very nice. Less memory than a set, but slightly longer time than mine, right? You could probably use a binary search to find the insertion point for the new elements to speed it up a bit. +1
JoshD
@Josh: Yeah binary searching your insertion point would definitely be a plan. I'm not entirely sure that it would take more time than yours, though. It is more "complex" but I suspect it'd run at least as fast due to cacheing more nicely :) Would be interesting to try implementing both and see which runs faster. Of course .. the biggest slow down will be the disk reads so optimisation wise you'd be best off loading blocks of integers asynchronously in a double buffered fashion ;)
Goz
@Goz: yeah I'd get killed with the cacheing issue. But as you said, the I/O would dwarf the computation times.
JoshD
To use binary search to find 'insertion point', implies keeping the array in order, in which case your insertions require copying (potentially) all the elements in the array (99 operations per item in the file if they are strictly ascending in value). You might be better off with a fixed-size binary heap instead, which has O(lnN) insertion, at least with N=100. ( if N is small enough, arrays are fine, or at least I've never got binary heap code to run as fast as a linear scan for N<=16, but for N=100 it probably will be better )
Pete Kirkham
Keeping the array sorted, and using binary search has an extra advantage: you compare only once with the minimum of the 100 numbers you've got to eliminate a number that wouldn't make it to the top 100. So instead of 100 comparisons per number, you do only one. When there are a billion numbers to test, the 99 comparisons you don't do would add up quickly. :)
Dysaster
I'd start by comparing to the SMALLEST value rather than the largest, as if it's less than the smallest, we can immediately discard and move along. Also, my initial impulse would be to use a linked list rather than an array to avoid all the data movement, but this is probably a trivial performance difference, as disk I/O is going to be the bottleneck.
Jay
+3  A: 

Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.

Aliostad
@Aliostad: Every time you read a new int,your algorithm need to traverse the array,you do quite a lot of comparisons(ranging from 1 to 100) which should cost considerate time when the total numbers here is 1 billion.
Tracy
@Tracy: There is no other way. You have to compare with the ints in the array starting from the smallest. Someone suggested binary search which I believe will improve.
Aliostad
+7  A: 

Here's my initial algorithm:

create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
    get next number N.
    if N > array[0]:
        if N > array[99]:
            shift array[1..99] to array[0..98].
            set array[99] to N.
        else
            find, using binary search, first index i where N <= array[i].
            shift array[1..i-1] to array[0..i-2].
            set array[i-1] to N.
        endif
    endif
endwhile

This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).

You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.


One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):

read first hundred numbers into array 1 and sort them descending.
while more numbers:
    read up to next hundred numbers into array 2 and sort them descending.
    merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
    if more numbers:
        read up to next hundred numbers into array 2 and sort them descending.
        merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
    else
        copy list 3 to list 1.
    endif
endwhile

I'm not sure, but that may end up being more efficient than the continual shuffling.

The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):

list3.clear()
while list3.size() < 100:
    while list1.peek() >= list2.peek():
        list3.add(list1.pop())
    endwhile
    while list2.peek() >= list1.peek():
        list3.add(list2.pop())
    endwhile
endwhile

Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.

I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.

As with most interviews, technical skill is one of the the things they're looking at.

paxdiablo
Wouldn't the constant sorting be an unnecessary task? Why bother sorting when all you have to do is a simple comparison. The best case scenario for any sort is O(n) right? Which is when the set is already sorted. But a comparison of each element, remember the lowest value of the highest values, would be O(n-1).
Mike Bethany
You may be right, @Pickle, that's why I stated I hadn't done a proper analysis - you have to compare the operations rather than the big-O properties, it may be that 10 million of the sort and merge operations is more efficient that a potential 10 million 100-value shuffles involved in my original answer. By the way, if you _find_ an O(n) sort, let me know, I'll buy the rights off you :-)
paxdiablo
I wrote out some of the logic and it looks like you have to at least do a bubble sort on the 100 (assuming they're sorted) after seeing if a number is larger than your smallest largest number. You pop your buffer of 100 then do a bubble sort to find the appropriate position for the new number.
Mike Bethany
RE: O(n) sort. A radix sort would work for this problem, but the actual performance might be worse than the other options. @PP: O(n-1)=O(n). Best case isn't always O(n), and an already sorted sequence isn't always the best case. For example, best case for quicksort is O(n log n), and the timing for quicksorting an already sorted sequence is O(n**2).
outis
@outis "Best case not always O(n)" Yeah, my meaning was the best **possible** case for the set of all sorts that would do no additional work if the target set is already sorted. I thought that was pretty clear. ;) And yeah also on the O(n-1) thing. I meant O(n) - O(2) if that makes any sense.
Mike Bethany
Nope, that makes no sense @me. And I did mean O(n-1) but the minus is irrelevant as it scales so that's why O(n) = O(n-1). Gotcha.
Mike Bethany
+1 Inserting into an array is the fastest in practice. According to my Java benchmark it's faster than using `TreeSet`, `BitSet` or `PriorityQueue` (which happens to be the slowest). The numbers are generated by `Random`.
starblue
+5  A: 

Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).

The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.

ruslik
@ruslik yeah,you are right,in practice,I/O is the major bottleneck for this question,but now let us just focus on the algorithm.Thanks
Tracy
@Tracy if the numbers are random, and there are many of them, the top 100 results will conversge rapidly to high values, so that most new values will be lesser than it. At this point the insertion cost of the choosed structure won't matter anymore. All that will matter is the `getNextInt()`. Sure, it's O(1), but don't forget the hidden constant! I've seen many solutions here that completely ignores it.
ruslik
+1  A: 

You are going to have to check every number, there is no way around that.

Just as a slight improvement on solutions offered,

Given a list of 100 numbers:

9595
8505
...
234
1

You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.

Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.

Tom Gullen
Note that searching for the insertion point is a binary search. As such, this is basically the same as paxdiabolo's answer.
outis
A: 

If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).

Sergey Bankevich
+20  A: 

Obviously the interviewers want you to point out two key facts:

  • You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
  • You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
    • Get-Size: Get the number of values in the container.
    • Find-Min: Get the smallest value.
    • Delete-Min: Remove the smallest value to replace it with a new, larger value.
    • Insert: Insert another element into the container.

By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.

For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).

In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.

Ferdinand Beyer
Actually, in practice, you could use an array and record the current min value in a separate variable because the time taken to copy numbers around when you want to insert into the array will probably be orders of magnitude less than reading 4 Gb (assuming the file is binary and consists of a packed list of 32 bit integers with no delimiters) off disk.
JeremyP
@Jeremy what about i read the first 100 ints and sort them out,then i start to read the 101th ints to the end one by one using binary search to insert the read ints to the proper position.Should this be faser?
Tracy
MY point was, it is not all that important because you have a minimum of 4Gb of disk blocks to get through. That is a lot of IO.
JeremyP
@Tracy: which is faster in practice depends on (among other things) how fast the [blit](http://www.catb.org/~esr/jargon/html/B/blit.html) operation is, since inserting into an array will require shifting all items before it. In terms of timing, deleting-and-inserting into an array is O(n), while deleting-and-inserting into a heap is O(log n) (actually, the delete-and-insert could be combined into a replace-and-sift-down, though that's still O(log n)).
outis
@Jeremy: You are right, the IO will really be the bottleneck in practice. Nevertheless, it could be easier and less error-prone to use an existing heap implementation than writing your own array-based logic.
Ferdinand Beyer
Moreover, this is a theoretic problem and the question says **try the best solution as you can**.
Ferdinand Beyer
You can use a Bit Map to store the numbers in memory using only ~500MB of data. This leads to a neat solution - see below (or vote it up and see above!)
Howard May
@Ferdinand Beyer: Agreed if the heap exists in an easy to get hold of library or even natively. By the way, the *best* solution does not necessarily mean "use the mathematically best algorithm". It might mean "use the algorithm that you can write in the next hour with a reasonable chance of being bug free".
JeremyP
+2  A: 

I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.

NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.

This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!

Howard May
@Howard:i see what you mean,but it costs 536MB mem,plus,the scanning time is definitely unacceptable,right
Tracy
@Tracy: 536MB is fine for a modern PC. Unless the problem statement constrains the memory you can use then I don't think its an issue. This solution only requires you to scan the file once so no there isn't an issue with scanning time.
Howard May
A: 

I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.

Paul
they did, 2 answers ago
Woot4Moo
Whoops. Missed the comments before. Thanks for the pointer
Paul