views:

801

answers:

4

What is a good way to get the top 10 records from a very large collection and use a custom OrderBy? If I use the LINQ to Objects OrderBy method it is slow and takes a lot of memory because it creates an entire new collection with the new order. I would like a new method with the signature below that does not re-order the entire collection and is very fast:

public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)

I tried to write it but it got very complicated and I thought there might be any easier way using Aggregate or something. Any help would be appreciated.

Answer

Thanks for the help. I ended up with the code below:

public static List<TSource> OrderByTop<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)
{
    var itemComparer = keySelector.ToIComparer(comparer);
    return source.Aggregate(
        new List<TSource>(topCount),
        (List<TSource> list, TSource item) =>
            list.SortedInsert(item, itemComparer, topCount));
}

The List Extension method SortedInsert follows:

public static List<T> SortedInsert<T>(
    this List<T> list,
    T item,
    IComparer<T> comparer,
    int maxLength)
{
    if (list.Count == maxLength)
        if (comparer.Compare(item, list[maxLength - 1]) >= 0)
            return list;
        else
            list.RemoveAt(maxLength - 1);
    int insertIndex = list.BinarySearch(item, comparer);
    if (insertIndex < 0)
        insertIndex = ~insertIndex;
    list.Insert(insertIndex, item);
    return list;
}

For those interested I also had keySelector Extension method to convert to IComparer.

public static IComparer<TSource> ToIComparer<TSource, TKey>(
    this Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    return new KeySelectorToIComparerConverter<TSource, TKey>(
        keySelector,
        comparer);
}
private class KeySelectorToIComparerConverter<TSource, TKey>
    : IComparer<TSource>
{
    private readonly IComparer<TKey> comparer;
    private readonly Func<TSource, TKey> keySelector;
    public KeySelectorToIComparerConverter(
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        this.comparer = comparer;
        this.keySelector = keySelector;
    }
    public int Compare(TSource x, TSource y)
    {
        return comparer.Compare(keySelector(x), keySelector(y));
    }
}
+3  A: 

I think what you want is really a selection algorithm. I don't know that LINQ is the best way to implement one since I think it basically ends up as selection by sorting. You ought to be able to do this in O(kN), where k is the "top" number of items by iterating through the collection, keeping track of the minimum "top" element seen so far and if the current element is bigger than that, replacing that element with the current element (and updating the new minimum element). This is space efficient as well.

When you are done you can return the "top" elements as an ordered collection.

Note: I'm assuming LINQ to Objects here. If you are using LINQ to SQL, then I'd defer simply defer the ordering/selection to the SQL server and simply chain the methods appropriately to get a select top N ... from ... order by ... query.

Completely untested, not even compiled. Uses a generic Fibonacci Heap implementation. I'll post the code on my blog (http://farm-fresh-code.blogspot.com) sometime soon. I've got one hanging around (not sure if it's generic) as a result of some experiments with priority queues that I was doing. See wikipedia for info and pseudocode until then.

public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)
{
    // allocate enough space to hold the number of elements (+1 as a new candidate is added)
    FibonacciHeap<TKey,TSource> top = new FibonacciHeap<TKey,TSource>( comparer );
    foreach (var candidate in source) // O(n)
    {
         TKey key = keySelector(candidate);
         TKey minimum = top.AccessMinimum();
         if (minimum == null || comparer.Compare( key, minimum.Key ) > 0) // O(1)
         {
             top.Insert( key, candidate ); // O(1)
             if (top.Count >= topCount)
             {
                 top.DeleteMinimum(); // O(logk)
             }
         }
    }
    return top.ToList().Reverse().Select( t.Value ); // O(k)   
}
tvanfosson
Thanks for the link. That is the type of algorithm I want. I was hoping something like that has already been written in C# and I would not have to write it myself. This seems like a common problem that should have a good solution out there already.
DRBlaise
Thanks for code but I went with MartinStettner's version because his handles duplicates and keeps the list sorted throughout.
DRBlaise
I can't really think of any easy way to extend for duplicate keys without either making more complex, more costly, or changing to use a sorted heap -- or using the same BinarySearch trick. I have a Fibonacci Heap implementation that is O(1) min/insert and O(logn) delete but that would add a lot of code. Using it would result in O(logkN) but like I said would require the heap implementation.
tvanfosson
+1  A: 

I do not know an other solution than writing this method. However this method should not be that complicated.

You need to maintain a sorted list with the top 10 elements, and iterate through the orinigal collection once.

If the current record during the iteration is smaller, than the last one from the top 10 list, or when you do not have your first 10 records yet, then you have to add the item to this list. (And of course, remove the last item from the top 10 list, when appropriate.)

treaschf
+4  A: 

Aggregate is a good place to start with:

SortedList<TKey, TSource> resultlist = new SortedList<TKey, TSource>();
MyBigList.Aggregate(resultlist, (aktlist,entry) => {
   aktlist.Add(entry.Key, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

If you want a different comparer, you can specify one in the constructor of the SortedList.

EDIT As mentioned by nikie, SortedListcannot contain double values. You can use a standard list together with BinarySearch to achieve the same effect:

List<TSource> resultlist = new List<TSource>();
MyBigList.Aggregate(resultlist, (aktlist, entry) => {
   int index = aktlist.BinarySearch(entry);
   if (index < 0) index = ~index;
   if (index < 10) aktlist.Insert(index, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

Again a custom comparer (together with a custom key selection) can be used as parameter to BinarySearch.

MartinStettner
IIRC SortedList throws an exception when a key already exists.
nikie
Very nice! It should be RemoveAt(10) though and like nikie said it does not accept duplicate keys.
DRBlaise
Thanks for your hints, I've edited the answer to reflect both of them ...
MartinStettner
Wow, I didn't know that BinarySearch gives you the bitwise complement of the larger element. I am giving you the answer!
DRBlaise
Actually you can save a lot of time, if you add a condition for the insertion (index < 10). I changed this in the post.
MartinStettner
+1  A: 

You could also implement a divide-and-conquer sorting algorithm like quicksort and break as soon as you have the first k sorted elements. But tvanfosson's suggestion is probably faster if k << N

nikie