views:

124

answers:

4

There was a question asked about how to sort a List. There were several methods given from the basic List.Sort() to List.OrderBy(). The most laughable was a roll-your-own-SelectionSort. I promptly voted that down, but it made me think; wouldn't Linq's OrderBy(), applied to a list, do the same thing? myList.OrderBy(x=>x.Property).ToList() would produce an iterator that basically finds the minimum value of the projection in what's left of the collection and yield returns it. When going through the entire list, that's a selection sort.

Which made me think; what algorithms do the built-in sorters for Lists, SortedLists, Enumerables, etc. use, and by extension, should any of them be avoided for large collections? A SortedList, as it stays sorted by key, would probably use a single-pass InsertionSort on each add; find the first index with a value greater than the new one, and insert before it. Lists and Arrays probably MergeSort themselves pretty efficiently, but I don't know the actual algorithm behind Sort(). We've discussed OrderBy.

What I know above would seem to indicate that List.Sort() or Array.Sort() are the best options for a list of known size, and using Linq to sort an in-memory list or array should be discouraged. For a stream, there really isn't any other way then to OrderBy() the enumerable; the performance loss is mitigated by the fact that you can keep the data as a stream instead of having to have it all before sorting it.

EDIT:

The general consensus is that Sort() is faster given a concrete implementation of a List or Array. OrderBy is reasonable but slower because it adds O(N) complexity of extracting an array from the passed enumerable. SortedList initialization ends up being O(N^2) because of what's under the hood. Moral of the story, use List.Sort() instead of List.OrderBy() when you have an actual List.

+3  A: 

A quick gander through reflector tells me that List Sort methods utilize quicksort http://en.wikipedia.org/wiki/Quicksort through System.Collections.Generic.GenericArraySortHelper

SortedList uses Array.BinarySearch to figure out where to insert stuff on each Add

Enumerators don't have sorting logic

Quicksort is a good sorting choice for most situations though it can approach O(n^2) if you're really unlucky with the input data.

If you suspect your input data to be a huge pile of data in an unlucky (already sorted) order for quicksort a trick is to randomize the data first (which is always cheap) and then do the sorting on the randomized data. There are a few tricks the quicksort algorithm can implement to mitigate the problem of sorting already sorted (or nearly sorted) input data, I don't know whether the BCL implementation does any of these.

AndreasKnudsen
+4  A: 

One way to find out the performance of each method is to measure it:

List<int> createUnsortedList()
{
    List<int> list = new List<int>();
    for (int i = 0; i < 1000000; ++i)
        list.Add(random.Next());
    return list;
}

void Method1()
{
    List<int> list = createUnsortedList();
    list.Sort();
}

void Method2()
{
    List<int> list = createUnsortedList();
    list.OrderBy(x => x).ToList();
}

Result:

  • Method1: 0.67 seconds (List.Sort)
  • Method2: 3.10 seconds (OrderBy)

This shows that the performance of OrderBy is reasonable even for very large lists, but it's not quite as fast as using the built-in Sort method on a list. This is probably because the code for OrderBy is slightly more flexible - it takes a key selector which must be evaluated for each element.

Mark Byers
+3  A: 

Yes, your assumptions sound right. I did a little test to confirm it.

On 5000000 integers,

data.Sort();                           //  500 ms
data = data.OrderBy(a => a).ToList();  // 5000 ms
Henk Holterman
This may demonstrate that OrderBy is not good to use on large collections, but possibly not for the reason I stated. Apparently using OrderBy requires knowledge of the entire enumerable, which destroys the streaming quality of unordered Linq iterators.
KeithS
+7  A: 

Enumerable.OrderBy() slurps the IEnumerable<> into an array and uses quick sort. O(n) storage requirements. It's done by an internal class in System.Core.dll, EnumerableSort<TElement>.QuickSort(). The storage cost makes it uncompetitive with simply sorting the list, if you have one, since List<> sorts in-place. Linq often optimizes by checking the true capabilities of the IEnumerable with the is operator. Won't work here since List<>.Sort is destructive.

List<>.Sort and Array.Sort use in-place quick sort.

SortedList<> has O(n) complexity for an insertion, dominating the O(log(n)) complexity of finding the insertion point. So putting N unsorted items into it will cost O(n^2). SortedDictionary<> uses a red-black tree, giving insert O(log(n)) complexity. Thus O(nlog(n)) to fill it, same as amortized quick sort.

Hans Passant
how come SortedList<> has O(n) for inserting? I would think that the BinarySearch made it O(log(N) )
AndreasKnudsen
@Andreas - it has to make room for the element to insert. Which requires moving O(n) elements. Its an array under the hood.
Hans Passant
Hmm. Now I'm wondering, what if SortedList used a two-way linked list implementation with a "center" reference? Approaching O(N) to index a single element (you can start at either end or the center and work toward the actual "index"), but also O(N) to iterate ("next" is cheap), and insertion, given the O(logN) binary search (you can start from the center), would be constant (reassign two pointers) for a total insertion complexity of O(logN). That would make a sorted two-way linked list O(NlogN) complexity to fill with N unsorted elements.
KeithS
@Keith: the big Oh has little regard with dividing the algorithm in two. The smaller Oh you'll get from a linked list is completely defeated by the way CPU caches work on modern machines. Which is heavily optimized to load contiguous bytes of memory from RAM. A linked list has very poor cache locality, stalling the CPU for hundreds of cycles on a cache-miss. Which is why List<> is actually an array under the hood, not a linked list from traditional data structure text books.
Hans Passant
KeithS: If you wanted O(lg n) operations on a `SortedList`, you would just use a `SortedDictionary`, since a `SortedList` is actually a list of `KeyValuePair` elements.
Gabe