views:

657

answers:

11

Hello everyone,

I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.

Whats the best way of doing the sorting to get the top 100 scores?

The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.

Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).

Lastly, whats the sorting algorithm used by the standard sort() function in c++?

Thanks,

-Faken

Edit: Just for anyone who is curious...

I did a few time trials on the before and after and here are the results:

Old program (preforms sorting after each outer loop iteration):

top 100 scores: 147 seconds
top  10 scores: 147 seconds
top   1 scores: 146 seconds
Sorting disabled: 55 seconds

new program (implementing tracking of only top scores and using default sorting function):

top 100 scores: 350 seconds <-- hmm...worse than before
top  10 scores: 103 seconds 
top   1 scores:  69 seconds 
Sorting disabled: 51 seconds

new rewrite (optimizations in data stored, hand written sorting algorithm):

top 100 scores: 71 seconds <-- Very nice!
top  10 scores: 52 seconds
top   1 scores: 51 seconds
Sorting disabled: 50 seconds

Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...

There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.

Thanks to eveyrone for their input!

A: 

You want the absolute largest X numbers, so I'm guessing you don't want some sort of heuristic. How unsorted is the list? If it's pretty random, your best bet really is just to do a quick sort on the whole list and grab the top X results.

If you can filter scores during the list generation, that's way way better. Only ever store X values, and every time you get a new value, compare it to those X values. If it's less than all of them, throw it out. If it's bigger than one of them, throw out the new smallest value.

If X is small enough you can even keep your list of X values sorted so that you are comparing your new number to a sorted list of values, you can make an O(1) check to see if the new value is smaller than all of the rest and thus throw it out. Otherwise, a quick binary search can find where the new value goes in the list and then you can throw away the first value of the array (assuming the first element is the smallest element).

AlbertoPL
Given that you need to look at every element in the list wouldn't it be faster just to iterate through the list maintaining an array of the largest 100 so far and a pointer to the smallest of the 100 swapping out when you encounter a larger number?
Steve Homer
Yes, and that will require that the list of 100 also stays sorted.
AlbertoPL
+21  A: 
  1. take the first 100 scores, and sort them in an array.
  2. take the next score, and insertion-sort it into the array (starting at the "small" end)
  3. drop the 101st value
  4. continue with the next value, at 2, until done

Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.

Martin v. Löwis
+1 for noting that there is no need to keep track of anything more than the top 100 elements. Wish I could give extra points for suggesting insertion sort as well.
Will Bickford
Nice, I love the beauty of that, simplistic and efficient!
Faken
The degenerate case is when your original list is in reverse sorted order. That will take 100 times longer than the average case, but will still be O(n).
Mark Ransom
Actually, you could find the insertion point at j in O(logn) time and move elements at j+1 to count-1 down a position, thereby dropping the smallest at count-1. And you could do this only if the new element is greater than the element at count-1. But if you are going to do that, you may as well use a heap, as Jack Lloyd is recommending.
hughdbrown
wow, great solution!
Justin Grant
Using a fixed pool of linked-list buckets here could possibly help wall time (not algorithmic performance); but this is going micro and requires benchmarking. Good answer and comments.
pst
@hughdbrown But at 100 elements it might not reach the C in the O to make it worth it.
pst
@pst: sorry. No decoder ring. Don't know what your C and O are. Assuming you are commenting on the binary search, it will be better than an insertion sort. If you are commenting on moving the elements down, then you can get all LINQ-y and do this: `top100 = top100.Take(j).Concat(new int[] {newElement}).Concat(top100.Skip(j-1).Take(size-j-1)).ToArray();` Of course, as you can see by my comments above and my answer below, I prefer using a heap/priority queue to inserting into sorted arrays.
hughdbrown
@hughdbrown: I question whether binary search will do significantly better. If you *always* do a binary search (not special-casing when the new element is too small), you will typically do 7 comparisons per new element, whereas the insertion sort will do 1 comparison. Plus, when it does insert a new element, you have to move k elements by one index in either case. If the scores are ints (say), then moving them may be actually more expensive than comparing them.
Martin v. Löwis
@Martin v. Lowis: "if you always do a binary search..." But that's not what I was suggesting. I said: "And you could do this only if the new element is greater than the element at count-1." That's exactly the case you are calling out. And if both methods have to move the same number of ints the same distance, then why is it better to do insertion sort when "moving them may actually be more expensive than comparing them"? The binary search likely does 7 comparisons vs 50 for insertion sort and the number of ints moved is the same. But, Jack Lloyd's solution is algorithmically better.
hughdbrown
@hughdbrown: I didn't say insertion sort is better - I only claimed that the binary search won't be significantly better (but perhaps slightly). I also believe that there will be much less than 50 comparisons for the insertion sort on average, because new scores will likely be small compared to the scores you already have.
Martin v. Löwis
++ Nice job, mainly your observation that most of the time the insertion sort will stop immediately. As far as insertion speed goes, binary search might be faster than linear, but they are both O(1) because N is bounded by 100.
Mike Dunlavey
use std::partial_sort, Luk!
f0b0s
A: 

Place the data into a balanced Tree structure (probably Red-Black tree) that does the sorting in place. Insertions should be O(lg n). Grabbing the highest x scores should be O(lg n) as well.

You can prune the tree every once in awhile if you find you need optimizations at some point.

Rob Spieldenner
I did mention that i haven't taken programming courses, sorry you went way over my head....
Faken
If you have some kind of library that lets you sort an array or List, the library also probably has something like a TreeMap that'll do the trick.
Rob Spieldenner
A: 

If you only need to report the value of top 100 scores (and not any associated data), and if you know that the scores will all be in a finite range such as [0,100], then an easy way to do it is with "counting sort"...

Basically, create an array representing all possible values (e.g. an array of size 101 if scores can range from 0 to 100 inclusive), and initialize all the elements of the array with a value of 0. Then, iterate through the list of scores, incrementing the corresponding entry in the list of achieved scores. That is, compile the number of times each score in the range has been achieved. Then, working from the end of the array to the beginning of the array, you can pick out the top X score. Here is some pseudo-code:

    let type Score be an integer ranging from 0 to 100, inclusive.
    let scores be an array of Score objects
    let scorerange be an array of integers of size 101.

    for i in [0,100]
        set scorerange[i] = 0

    for each score in scores
        set scorerange[score] = scorerange[score] + 1

    let top be the number of top scores to report
    let idx be an integer initialized to the end of scorerange (i.e. 100)

    while (top > 0) and (idx>=0):
        if scorerange[idx] > 0:
              report "There are " scorerange[idx] " scores with value " idx
              top =  top - scorerange[idx]
        idx = idx - 1;
Michael Aaron Safyan
+2  A: 

Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.

Something like this (C# code, but you get the idea):

Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
   int pos = size;
   while (pos > 0 && toplist[pos - 1] < score) {
      pos--;
      if (pos < 99) toplist[pos + 1] = toplist[pos];
   }
   if (size < 100) size++;
   if (pos < size) toplist[pos] = score;
}

I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.

Guffa
Hmm, so generate the entire score array first before i do the insertion sort...I guess ill have to look at which method would produce the most cache hits before implementing it. Thanks.
Faken
@Faken: I don't know if it has anything to do with cache hits, but apparently this code is 700 times faster than Jack Lloyd's code using a heap...
Guffa
+6  A: 

You can do this in O(n) time, without any sorting, using a heap:

#!/usr/bin/python

import heapq

def top_n(l, n):
    top_n = []

    smallest = None

    for elem in l:
        if len(top_n) < n:
            top_n.append(elem)
            if len(top_n) == n:
                heapq.heapify(top_n)
                smallest = heapq.nsmallest(1, top_n)[0]
        else:
            if elem > smallest:
                heapq.heapreplace(top_n, elem)
                smallest = heapq.nsmallest(1, top_n)[0]

    return sorted(top_n)


def random_ints(n):
    import random
    for i in range(0, n):
        yield random.randint(0, 10000)

print top_n(random_ints(1000000), 100)

Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):

  • 100000 elements: .29 seconds
  • 1000000 elements: 2.8 seconds
  • 10000000 elements: 25.2 seconds

Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.

Jack Lloyd
@strager For any constant X, say 100, the heap operations can be treated as being constant time, since they are log(X) or X*log(X); with X constant, these are treated asymptotically as O(1). And this is not a sorting method, really, unless you set X = N, in which case, of course, X is not a constant.
Jack Lloyd
@Lloyd, Yeah, I realized that. =X
strager
+1  A: 

You can do it in Haskell like this:

largest100 xs = take 100 $ sortBy (flip compare) xs

This looks like it sorts all the numbers into descending order (the "flip compare" bit reverses the arguments to the standard comparison function) and then returns the first 100 entries from the list. But Haskell is lazily evaluated, so the sortBy function does just enough sorting to find the first 100 numbers in the list, and then stops.

Purists will note that you could also write the function as

largest100 = take 100 . sortBy (flip compare)

This means just the same thing, but illustrates the Haskell style of composing a new function out of the building blocks of other functions rather than handing variables around the place.

Paul Johnson
A: 

I answered this question in response to an interview question in 2008. I implemented a templatized priority queue in C#.

using System;
using System.Collections.Generic;
using System.Text;

namespace CompanyTest
{
    // Based on pre-generics C# implementation at
    //  http://www.boyet.com/Articles/WritingapriorityqueueinC.html
    // and wikipedia article
    //  http://en.wikipedia.org/wiki/Binary_heap
    class PriorityQueue<T>
    {
     struct Pair
     {
      T val;
      int priority;
      public Pair(T v, int p)
      {
       this.val = v;
       this.priority = p;
      }
      public T Val { get { return this.val; } }
      public int Priority { get { return this.priority; } }
     }
     #region Private members
     private System.Collections.Generic.List<Pair> array = new System.Collections.Generic.List<Pair>();
     #endregion
     #region Constructor
     public PriorityQueue()
     {
     }
     #endregion
     #region Public methods
     public void Enqueue(T val, int priority)
     {
      Pair p = new Pair(val, priority);
      array.Add(p);
      bubbleUp(array.Count - 1);
     }
     public T Dequeue()
     {
      if (array.Count <= 0)
       throw new System.InvalidOperationException("Queue is empty");
      else
      {
       Pair result = array[0];
       array[0] = array[array.Count - 1];
       array.RemoveAt(array.Count - 1);
       if (array.Count > 0)
        trickleDown(0);
       return result.Val;
      }
     }
     #endregion
     #region Private methods
     private static int ParentOf(int index)
     {
      return (index - 1) / 2;
     }
     private static int LeftChildOf(int index)
     {
      return (index * 2) + 1;
     }
     private static bool ParentIsLowerPriority(Pair parent, Pair item)
     {
      return (parent.Priority < item.Priority);
     }
     // Move high priority items from bottom up the heap
     private void bubbleUp(int index)
     {
      Pair item = array[index];
      int parent = ParentOf(index);
      while ((index > 0) && ParentIsLowerPriority(array[parent], item))
      {
       // Parent is lower priority -- move it down
       array[index] = array[parent];
       index = parent;
       parent = ParentOf(index);
      }
      // Write the item once in its correct place
      array[index] = item;
     }
     // Push low priority items from the top of the down
     private void trickleDown(int index)
     {
      Pair item = array[index];
      int child = LeftChildOf(index);
      while (child < array.Count)
      {
       bool rightChildExists = ((child + 1) < array.Count);
       if (rightChildExists)
       {
        bool rightChildIsHigherPriority = (array[child].Priority < array[child + 1].Priority);
        if (rightChildIsHigherPriority)
         child++;
       }
       // array[child] points at higher priority sibling -- move it up
       array[index] = array[child];
       index = child;
       child = LeftChildOf(index);
      }
      // Put the former root in its correct place
      array[index] = item;
      bubbleUp(index);
     }
     #endregion
    }
}
hughdbrown
+4  A: 

Here's the 'natural' C++ way to do this:

std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);

This is linear in the number of scores.

The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.

Richard Smith
yeah, just wanted to answer in such way!
f0b0s
+2  A: 

Since speed is of the essence here, and 40.000 possible highscore values is totally maintainable by any of today's computers, I'd resort to bucket sort for simplicity. My guess is that it would outperform any of the algorithms proposed thus far. The downside is that you'd have to determine some upper limit for the highscore values.

So, let's assume your max highscore value is 40.000:

Make an array of 40.000 entries. Loop through your highscore values. Each time you encounter highscore x, increase your array[x] by one. After this, all you have to do is count the top entries in your array until you have reached 100 counted highscores.

Pedery
Well, a bucket sort would work in finding my top 100 scores, but it would only give me the top scores. I guess it was my fault, i didn't define the problem as exactly as i should have. Each score is derived from 3 values, each of these scores must have those 3 values tagged along with the score, thus a bucket sort would not suit my needs. But your right, this method would defiantly outperform other methods if the range was small and i wasn't sorting classes.
Faken
Hmm...on second thought, it may work if i implemented some sort of list attached to each of the buckets for storing other data...but that would would be extremely memory intensive unless i put a cutoff point somewhere, but even then, i wouldn't be able to guess a high range without iterating over the entire data set.
Faken
But then again i could always implement a cutoff after every say, outer loop iteration that checks where my top 100 scores are and an if statement to check if the next score was within that high score value...this might actually work to be even more efficient! The only downside would be memory usage, the current best answer only uses like a maximum of 400Kb of memory total...but then again, with 8 GB of ram, whats a few hundred MB? (err, well i guess cache would have a lot to do with this...the earlier program would sit very nicely in L2 cache entirely though). Either way, it is interesting...
Faken
Not really. Remember, in your array you'll only store pointers to whatever structures you have chosen to use to encapsulate your data. In this new scenario, you'll have an array of pointers in each of your potential 40.000 slots. Thus, you'll have room for 40.000 32 bit pointers instead of 40.000 32 bit integers. As for the data itself it would have to be stored anyway, so no redundant memory spent there.You can also implement a safety valve function that will resize your 40.000-array by i.e. 10.000 if there ever are values above the highest estimated value.
Pedery
A: 

Median of medians algorithm.

jk