views:

1093

answers:

5

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.

Note: The numbers are decimals and the list can include duplicates.

[Edit]: My memory limit is 16 MB.

A: 

If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.

freitass
I did that using a min-heap priority queue but I run out of memory.
snake
Then you don't have sufficient memory to solve your problem. I don't think you can do better than storing the top N numbers.
PanCrit
A: 

sort -n | uniq -c and the Nth should be the Nth row

LarsOn
My memory constraints prevent storing N or more numbers...
snake
+1  A: 

This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.


Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.

  1. Imagine a 1 billion bit-vector stored as an array of 32-bit words
  2. Let it be initialized to all zeros
  3. Start running through your numbers and keep setting the correct bit position for the value of the number
  4. When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
  5. That bit's position gives you the value for your Nth largest number
  6. You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)
nik
I'm using doubles, as I have decimal values. And I also have duplicates.
snake
I don't think I understand this. It fails if any of the numbers are duplicates of each other, right?
Nosredna
Right, nik's idea would work if I had distinct 32-bit values. If I understood correctly, it's like having a list of every number possible and checking off the ones that we find. Then we go through and find the Nth largest or smallest number by starting at one end and counting.
snake
Yes you got it right, the list-of-every-number-possible is still taking lesser space than that required to sort your list. However, there appear to be two problems. [1] Your numbers are doubles, that changes the complexity. [2] Like I noted earlier, the method is transparent to duplicates. You will count each instance of the number just once (unless some additional intelligence is added to the scheme). But, do you count duplicates when looking for the Nth largest 'value'?
nik
+3  A: 

Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.

It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.

You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.

If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.

Michael Madsen
Thanks. I'll see if I can get that to work within reasonable time constraints.
snake
Note that if your set contains duplicates, this makes the problem a bit harder, because you haven't necessarily filtered out all occurences of your current limit. In that case, you'll need to find a way of handling that.
Michael Madsen
One option would be to count the occurences of the current maximum value, then use your temporary storage to figure out how many you missed (not in your storage). You then use this to adjust your pass counter (progress "tracker") accordingly.
Michael Madsen
A good answer. Your multipass algorithm inspired my answer.
Nosredna
+3  A: 

This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).

First pass:

  • Find the highest value and the lowest value. That's your initial range.

Passes after the first:

  1. Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
  2. Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
  3. Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
  4. The winning bin's top and bottom range are your new range.
  5. Loop through these steps again until you have enough memory to hold the numbers in the current bin.

Last pass:

  • You should know how many numbers are above the current bin by now.
  • You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.

Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:

 (- 0.0 - 100.0]
 (100.0 - 200.0]
 (200.0 - 300.0]
 ...
 (900.0 - 1000.0)

If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:

 (100.0 - 110.0]
 (110.0 - 120.0]
 etc.


Another multipass idea:

Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).

Nosredna
I have a feeling this will reduce the number of passes needed, and it also saves the time spent putting values in a priority queue, because you only use the queue for the final pass. Nice improvement.
Michael Madsen
I think he can get to the answer very quickly by having 100,000 bins. 10 was just an example.
Nosredna
Great, I'll try this. I think I'll divide the number of numbers by 10 to arrive at the number of bins, so afterward, I'll have only a maximum of 10 numbers to store and sort. Then I can use linear selection.
snake