views:

1408

answers:

8

Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...

+1  A: 

Use heapsort. It only partially orders the list until you draw the elements out.

UncleO
Try to find the n/2-th element - Requires O(nlogn)!
Dario
+3  A: 

You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.

Andrew Hare
But when it isn't the fifth but the nth element, you'll have O(n²) which is even worse than sorting.
Dario
I suppose you mean to maintain a list of the N largest values. But N cannot be too large in that case.
Bruno Ranschaert
+1  A: 

You essentially want to produce a "top-N" list and select the one at the end of that list.

So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.

After you finish scanning, pick the last item in your top-N list.

An example for ints and N = 5:

int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value

for(int i = 0; i < largeArray.length; i++) {
    if(largeArray[i] > top5[4]) {
       // insert into top5:
       top5[4] = largeArray[i];

       // resort:
       quickSort(top5);
    }
}
Jeff Meatball Yang
+8  A: 

Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.

Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).

Dario
Nice link. I believe this is the best.
Jeff Meatball Yang
+1  A: 

As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).

However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).

According to Wikipedia, this is the best selection algorithm:

 function findFirstK(list, left, right, k)
     if right > left
         select pivotIndex between left and right
         pivotNewIndex := partition(list, left, right, pivotIndex)
         if pivotNewIndex > k  // new condition
             findFirstK(list, left, pivotNewIndex-1, k)
         if pivotNewIndex < k
             findFirstK(list, pivotNewIndex+1, right, k)

Its complexity is O(n)

Unknown
I believe the Tournament Algorithm, see Dario's links, is what you are shooting for. It has an operation of O(n + k*log(n)).
tgray
My mistake, though I would be interested to see a full implementation of this in Python.
tgray
+3  A: 

A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).

Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison. In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.

SPWorley
+4  A: 

A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.

import heapq

def nth_largest(n, iter):
    return heapq.nlargest(n, iter)[-1]

Example Usage:

>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920

Confirm result by sorting:

>>> list(sorted(iter))[-10]
920
FogleBird
A: 

You could try the Median of Medians method - it's speed is O(N).