views:

169

answers:

8

My code processes a huge number of values and I'm looking for an efficient structure to keep track of the top (N) values, where N is less than 10, so collecting ALL numbers then sorting the list and taking the first (N) is probably not the most efficient way.

To do that, I'm building a collection of fixed size N, to keep the top (N) values sorted in descending order. The Add(T value) method of the sorted collection would add the value to the collection if value is higher than any of the existing values (in which case the last element is removed) or if the collection is not full.

I was able to implement what I wanted using a doubly LinkedList<T> since it has fast insertion and removal, but I was wondering if using SortedDictionary<TKey, TValue> or a priority queue would be better ?

Thank you.

+6  A: 

I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.

Svante
if not, you could also sort the heap in the opposite order and remove the smallest item whenever the size is greater than N after an insert.
mihi
+1  A: 

Edit:

If only the first N values are needed and the others are not of any interest, a plain old array will get the work done cheaply.

Keep it sorted and test against the biggest. And only if it needs to be stored, insert it correctly and shift the remaining elements. With small sizes this is a cheap operation, and my guess is it won't be done often.

Eiko
+1  A: 

If you have a fix size of 10, why not simply use a sorted array of length 10 and binary search? But I am not sure if at this size, binary search is not a huge win over a dumb search along the array due to some overhead.

Frank
In this case, the search would be fast, but inserting an element is expensive because it requires copying the array into a new one.
alhazen
No, as I understood the size should always be limited to 10 items, you would shift off the last one. You only would have to move all items after the position where you insert one further. And you completely avoid the object instantiation of a linked list insertion.
Frank
+4  A: 

The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value). But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...

Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).

EDIT:

fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this

digEmAll
Linked list is suboptimal because insertion is O(N) operation.
olegz
To be precise, insertion is O(1) because doesn't require shifting; to find where to insert requires O(n) (at most). I mentioned it when I said "linear comparison". Anyway I've just edited it, I've changed "best way" with "good way" ;)
digEmAll
Linked list uses one chunk of memory per item, and one new chunk need to be reserved for each new item; this is somewhat more expensive than an array, that can be reserved once for all (since its maximal size is known). While garbage collection works well, it is probably even better to avoid it.
@fred-hh: yes you're right. In fact probably the best way is a fixed array as a binary heap. Anyway,I think that with at most 10 elements we'll never notice the difference...
digEmAll
+2  A: 

For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).

phkahler
Although the set is quite small (10 elements only) but this sounds very inefficient to me. What if the new element is bigger than the biggest one in the list? You will replace the smallest with it, and the list becomes unsorted. Now you have to use a linear search through all the elements to find the smallest in the next insertions :-s
instcode
Right, the list is UNsorted. You only scan it after you swap the new value. You store the lowest value and it's position. Checking subsequent values is just a comparison with that stored lowest one. If you've got a larger one, then swap and rescan.
phkahler
+2  A: 

Unless you have a solid reason to do otherwise, I'd use a priority queue.

There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.

You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.

While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.

Jerry Coffin
I would avoid "always insert each incoming item" if speed is really important, because insertion is not (that) cheap (O(log N)) and, anyway, after the N first incoming items, the necessary comparison with the top item in the priority queue (= worst of the N best) suffices.
+2  A: 

The performance may really change.

For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.

Then there are 3 main possibilities on how to arrange the items in the array:

  1. sorted is probably the best choice to keep things simple:
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(N) time to insert - but this only happens for items that are in the N best-so-far. And if your input is sufficiently random, the average time will be even lower because most insertions will only move a few elements at the bottom of the top.
  2. unsorted:
    • O(N) time for each input element, that's too much compared to "sorted"
  3. binary heap that implements a priority queue: more complex to implement but maybe even faster than "sorted"
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(log N) time to insert - and this only happens for items that are in the N best-so-far
A: 

Use binary insertion sort on a raw array, pushing the smallest value off the end. This is routinely the fastest method used to maintain small sorted arrays and, for example, is generally used as a special case for various sorting algorithms (e.g. MergeSort).

Rex Kerr