ansaurus

Question

OrderBy and Top in LINQ with good performance

Answer 1

+3 A:

I think what you want is really a selection algorithm. I don't know that LINQ is the best way to implement one since I think it basically ends up as selection by sorting. You ought to be able to do this in O(kN), where k is the "top" number of items by iterating through the collection, keeping track of the minimum "top" element seen so far and if the current element is bigger than that, replacing that element with the current element (and updating the new minimum element). This is space efficient as well.

When you are done you can return the "top" elements as an ordered collection.

Note: I'm assuming LINQ to Objects here. If you are using LINQ to SQL, then I'd defer simply defer the ordering/selection to the SQL server and simply chain the methods appropriately to get a select top N ... from ... order by ... query.

Completely untested, not even compiled. Uses a generic Fibonacci Heap implementation. I'll post the code on my blog (http://farm-fresh-code.blogspot.com) sometime soon. I've got one hanging around (not sure if it's generic) as a result of some experiments with priority queues that I was doing. See wikipedia for info and pseudocode until then.

public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)
{
    // allocate enough space to hold the number of elements (+1 as a new candidate is added)
    FibonacciHeap<TKey,TSource> top = new FibonacciHeap<TKey,TSource>( comparer );
    foreach (var candidate in source) // O(n)
    {
         TKey key = keySelector(candidate);
         TKey minimum = top.AccessMinimum();
         if (minimum == null || comparer.Compare( key, minimum.Key ) > 0) // O(1)
         {
             top.Insert( key, candidate ); // O(1)
             if (top.Count >= topCount)
             {
                 top.DeleteMinimum(); // O(logk)
             }
         }
    }
    return top.ToList().Reverse().Select( t.Value ); // O(k)   
}

tvanfosson 2010-01-16 16:30:06

Thanks for the link. That is the type of algorithm I want. I was hoping something like that has already been written in C# and I would not have to write it myself. This seems like a common problem that should have a good solution out there already.

DRBlaise 2010-01-16 16:41:39

Thanks for code but I went with MartinStettner's version because his handles duplicates and keeps the list sorted throughout.

DRBlaise 2010-01-16 18:02:05

I can't really think of any easy way to extend for duplicate keys without either making more complex, more costly, or changing to use a sorted heap -- or using the same BinarySearch trick. I have a Fibonacci Heap implementation that is O(1) min/insert and O(logn) delete but that would add a lot of code. Using it would result in O(logkN) but like I said would require the heap implementation.

tvanfosson 2010-01-16 18:31:32

Answer 2

+1 A:

I do not know an other solution than writing this method. However this method should not be that complicated.

You need to maintain a sorted list with the top 10 elements, and iterate through the orinigal collection once.

If the current record during the iteration is smaller, than the last one from the top 10 list, or when you do not have your first 10 records yet, then you have to add the item to this list. (And of course, remove the last item from the top 10 list, when appropriate.)

treaschf 2010-01-16 16:32:17

Answer 3

+4 A:

Aggregate is a good place to start with:

SortedList<TKey, TSource> resultlist = new SortedList<TKey, TSource>();
MyBigList.Aggregate(resultlist, (aktlist,entry) => {
   aktlist.Add(entry.Key, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

If you want a different comparer, you can specify one in the constructor of the SortedList.

EDIT As mentioned by nikie, SortedListcannot contain double values. You can use a standard list together with BinarySearch to achieve the same effect:

List<TSource> resultlist = new List<TSource>();
MyBigList.Aggregate(resultlist, (aktlist, entry) => {
   int index = aktlist.BinarySearch(entry);
   if (index < 0) index = ~index;
   if (index < 10) aktlist.Insert(index, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

Again a custom comparer (together with a custom key selection) can be used as parameter to BinarySearch.

MartinStettner 2010-01-16 16:56:01

IIRC SortedList throws an exception when a key already exists.

nikie 2010-01-16 17:04:54

Very nice! It should be RemoveAt(10) though and like nikie said it does not accept duplicate keys.

DRBlaise 2010-01-16 17:05:00

Thanks for your hints, I've edited the answer to reflect both of them ...

MartinStettner 2010-01-16 17:26:37

Wow, I didn't know that BinarySearch gives you the bitwise complement of the larger element. I am giving you the answer!

DRBlaise 2010-01-16 17:43:55

Actually you can save a lot of time, if you add a condition for the insertion (index < 10). I changed this in the post.

MartinStettner 2010-01-16 21:03:24

Answer 4

+1 A:

You could also implement a divide-and-conquer sorting algorithm like quicksort and break as soon as you have the first k sorted elements. But tvanfosson's suggestion is probably faster if k << N

nikie 2010-01-16 17:07:45

ansaurus

tags:

views:

answers:

OrderBy and Top in LINQ with good performance

related questions