views:

661

answers:

7

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!

+5  A: 

If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.

If you can't read them into memory, this is what I came up with:

  1. Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.

  2. Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.

  3. Do another pass through, finding the next x largest integers less than x1, the least of which is x2.

  4. I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.

ajduff574
@ajduff574:as i have updated my question, there are 10G(1000000000) integers
didxga
+1 Impressive. But I foresee a problem dealing with large amounts of repeated numbers. Imagine that you are halfway through a pass through the file and your 2G sorted memory array fills up. Throughout the second half of your pass, you do not encounter any numbers that enable you to evict elements from your 2G array, but you encounter a lot of numbers that are exactly equal to the smallest element in the array (x1). You reach the end, discard your list, begin the next pass, and realize that you don't know which of the x1's you have discarded previously and which you have not.
advait
A possible solution might be not to use x1 in this case and use x1+1, thereby discarding everything in your 2G array that is greater than or equal to x1+1. However, you might reach a point where your entire 2G array becomes homogeneous. What then? You can't discard any numbers!
advait
Good catch! I think you could also keep track of how many times the smallest number (x) has appeared in your array. If it appears n times, then on the next pass you find the largest numbers <= to x, discarding the first n that you find that are equal to x. If it appears only once, this is the same as above. Even if the whole list is the same, this should still work.
ajduff574
This has to be repeated n/2k times in the worst case. So if you use a sorted list implementation, that can do search, insert and delete in O(log n), the time complexity is O(n<sup>2</sup>/k log k), where k is the number of integers in the sorted list. That means if k comes close to n it tends to be O(nlog n) and in the other case if k is a lot smaller than n it tends to be O(n<sup>2</sup>). In the given case k is close to n/2. So its nearly O(nlog n). Median of Medians algorithm is O(n) but it maybe needs to save the partitions to the disk, so i think this should probably be faster. +1
rudi-moore
... If k is alot smaller than n there is a modification which will make it fast too. For every number in the sorted list also save a count. If the sorted list is completly filled, don't delete the smallest. Instead compare the count of the smallest and 2nd smallest with the count of the largest and 2nd largest element. If the count of the 2 smallest elements is less, delete the 2nd smallest element and add the count to the smallest element else do the same for the largest element. ...
rudi-moore
... After the pass find the median from that sorted list (paying attention to the count). Take the number from the element before and after that median. Start a new pass and only consider numbers between that two numbers (excluding them). Repeat until only one number is found. I think worst case for that is O(n log n log k) and average case should be O(nlog k).
rudi-moore
A: 

Make a pass through the file and find count of integers and minimum and maximum integer value.

Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.

partition count > count => median lies within that partition.

Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.

Am sure this'd work for an arbitrary number of partitions as well.

Will A
+3  A: 
  1. Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
  2. Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.

The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.

Given n = number of integers in the original file:

  • Running time: O(nlogn)
  • Memory: O(1), adjustable
  • Disk: O(n)
Chris Schmich
+7  A: 

You can use the Medians of Medians algorithm.

starblue
+1, the factor of 5 difference between 10G and 2G sounds like this is the expected answer.
Ants Aasma
@Ants Aasma, 10G integers is usually 40GB which is 20x of 2GB or RAM. However Medians of Medians can still work.
grokus
Ah yeah, so it is. I originally misread that as 10GB of integers.
Ants Aasma
It should be noted that this gives only an **approximation** of the median. Taking the list in the Wikipedia example and sorting them, the middle values are 49 and 50 (so you could define the median to be 49.5) and NOT 47 as highlighted in the example there.
Andre Holzner
@Andre Holzer, it is a selection algorithm that uses an approximation of the median to find, for example, the exact median in O(n). The example on Wikipedia is about how to find that approximation, the pivot for the pseudo code above it.
Ishtar
A: 

Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.

spbfox
A: 

My best guess that probabilistic median of medians would be the fastest one. Recipe:

  1. Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
  2. Then calculate median of these integers and assign it to variable X_new.
  3. If iteration is not first - calculate median of two medians:

    X_global = (X_global + X_new) / 2

  4. When you will see that X_global fluctuates not much - this means that you found approximate median of data.

But there some notes :

  • question arises - Is median error acceptable or not.
  • integers must be distributed randomly in a uniform way, for solution to work

EDIT: I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:

X_global = k*X_global + (1.-k)*X_new :

k from [0.5 .. 1.], and increases in each iteration.

Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000

// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
    int iter = 0;
    int X_global = 0;
    int X_new = 0;
    int i = 0;
    float dk = 0.002;
    float k = 0.5;
    srand(time(NULL));

    while (i<ARRAY_SIZE && k!=1.) {
        X_new=0;
        for (int j=i; j<i+RANGE_SIZE; j++) {
            X_new+=rand()%10000 + 1;
        }
        X_new/=RANGE_SIZE;

        if (iter>0) {
            k += dk;
            k = (k>1.)? 1.:k;
            X_global = k*X_global+(1.-k)*X_new;

        }
        else {
            X_global = X_new;
        }

        i+=RANGE_SIZE+1;
        iter++;
        printf("iter %d, median = %d \n",iter,X_global);
    }

    return 0;

}

Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.

Good luck.

0x69
+7  A: 

Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.

Now you count up in that histogram until you reach the bin that covers the midpoint of the values.

Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.

Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.

Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).

Here's some sample Scala code that does this:

def medianFinder(numbers: Iterable[Int]) = {
  def midArgMid(a: Array[Long], mid: Long) = {
    val cuml = a.scanLeft(0L)(_ + _).drop(1)
    cuml.zipWithIndex.dropWhile(_._1 < mid).head
  }
  val topHistogram = new Array[Long](65536)
  var count = 0L
  numbers.foreach(number => {
    count += 1
    topHistogram(number>>>16) += 1
  })
  val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
  val botHistogram = new Array[Long](65536)
  numbers.foreach(number => {
    if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
  })
  val (botCount,botIndex) =
    midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
  (topIndex<<16) + botIndex
}

and here it is working on a small set of input data:

scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345

If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.

Rex Kerr
Smart. I like it.
Patrick
Nice! This one is better than mine, and with code!
ajduff574
The complexities are really impressive. Kind of remind me the radix sort / bucket sort ideas.
Matthieu M.