Which sort algorithm works best on mostly sorted data?
Based on the highly scientific method of watching animated gifs I would say Insertion and Bubble sorts are good candidates.
Keep away from QuickSort - its very inefficient for pre-sorted data. Insertion sort handles almost sorted data well by moving as few values as possible.
Insertion sort is best case O(n) on sorted input. And it is very close on mostly sorted input (better than quick sort).
Insertion sort with the following behavior:
- For each element
k
in slots1..n
, first check whetherel[k] >= el[k-1]
. If so, go to next element. (Obviously skip the first element.) - If not, use binary-search in elements
1..k-1
to determine the insertion location, then scoot the elements over. (You might do this only ifk>T
whereT
is some threshold value; with smallk
this is overkill.)
This method makes the least number of comparisons.
ponder Try Heap. I believe it's the most consistent of the O(n lg n) sorts.
As everyone else said, be careful of naive Quicksort - that can have O(N^2) performance on sorted or nearly sorted data. Nevertheless, with an appropriate algorithm for choice of pivot (either random or median-of-three - see Choosing a Pivot for Quicksort), Quicksort will still work sanely.
In general, the difficulty with choosing algorithms such as insert sort is in deciding when the data is sufficiently out of order that Quicksort really would be quicker.
Try introspective sort. http://en.wikipedia.org/wiki/Introsort
It's quicksort based, but it avoids the worst case behaviour that quicksort has for nearly sorted lists.
The trick is that this sort-algorithm detects the cases where quicksort goes into worst-case mode and switches to heap or merge sort. Nearly sorted partitions are detected by some non naiive partition method and small partitions are handled using insertion sort.
You get the best of all major sorting algorithms for the cost of a more code and complexity. And you can be sure you'll never run into worst case behaviour no matter how your data looks like.
If you're a C++ programmer check your std::sort algorithm. It may already use introspective sort internally.
I'm not going to pretend to have all the answers here, because I think getting at the actual answers may require coding up the algorithms and profiling them against representative data samples. But I've been thinking about this question all evening, and here's what's occurred to me so far, and some guesses about what works best where.
Let N be the number of items total, M be the number out-of-order.
Bubble sort will have to make something like 2*M+1 passes through all N items. If M is very small (0, 1, 2?), I think this will be very hard to beat.
If M is small (say less than log N), insertion sort will have great average performance. However, unless there's a trick I'm not seeing, it will have very bad worst case performance. (Right? If the last item in the order comes first, then you have to insert every single item, as far as I can see, which will kill the performance.) I'm guessing there's a more reliable sorting algorithm out there for this case, but I don't know what it is.
If M is bigger (say equal or great than log N), introspective sort is almost certainly best.
Exception to all of that: If you actually know ahead of time which elements are unsorted, then your best bet will be to pull those items out, sort them using introspective sort, and merge the two sorted lists together into one sorted list. If you could quickly figure out which items are out of order, this would be a good general solution as well -- but I haven't been able to figure out a simple way to do this.
Further thoughts (overnight): If M+1 < N/M, then you can scan the list looking for a run of N/M in a row which are sorted, and then expand that run in either direction to find the out-of-order items. That will take at most 2N comparisons. You can then sort the unsorted items, and do a sorted merge on the two lists. Total comparisons should less than something like 4N+M log2(M), which is going to beat any non-specialized sorting routine, I think. (Even further thought: this is trickier than I was thinking, but I still think it's reasonably possible.)
Another interpretation of the question is that there may be many of out-of-order items, but they are very close to where they should be in the list. (Imagine starting with a sorted list and swapping every other item with the one that comes after it.) In that case I think bubble sort performs very well -- I think the number of passes will be proportional to the furthest out of place an item is. Insertion sort will work poorly, because every out of order item will trigger an insertion. I suspect introspective sort or something like that will work well, too.
Splaysort is an obscure sorting method based on splay trees, a type of adaptive binary tree. Splaysort is good not only for partially sorted data, but also partially reverse-sorted data, or indeed any data that has any kind of pre-existing order. It is O(nlogn) in the general case, and O(n) in the case where the data is sorted in some way (forward, reverse, organ-pipe, etc.).
Its great advantage over insertion sort is that it doesn't revert to O(n^2) behaviour when the data isn't sorted at all, so you don't need to be absolutely sure that the data is partially sorted before using it.
Its disadvantage is the extra space overhead of the splay tree structure it needs, as well as the time required to build and destroy the splay tree. But depending on the size of data and amount of pre-sortedness that you expect, the overhead may be worth it for the increase in speed.
A paper on splaysort was published in Software--Practice & Experience.
timsort
Timsort is "an adaptive, stable, natural mergesort" with "supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1)". Python's built-in sort()
has used this algorithm for some time, apparently with good results. It's specifically designed to detect and take advantage of partially sorted subsequences in the input, which often occur in real datasets. It is often the case in the real world that comparisons are much more expensive than swapping items in a list, since one typically just swaps pointers, which very often makes timsort an excellent choice. However, if you know that your comparisons are always very cheap (writing a toy program to sort 32-bit integers, for instance), other algorithms exist that are likely to perform better. The easiest way to take advantage of timsort is of course to use Python, but since Python is open source you might also be able to borrow the code. Alternately, the description above contains more than enough detail to write your own implementation.
If you are in need of specific implementation for sorting algorithms, data structures or anything that have a link to the above, could I recommend you the excellent "Data Structures and Algorithms" project on CodePlex?
It will have everything you need without reinventing the wheel.
Just my little grain of salt.
Bubble-sort (or, safer yet, bi-directional bubble sort) is likely ideal for mostly sorted lists, though I bet a tweaked comb-sort (with a much lower initial gap size) would be a little faster when the list wasn't quite as perfectly sorted. Comb sort degrades to bubble-sort.
Insertion sort takes time O(n + the number of inversions).
An inversion is a pair (i, j)
such that i < j && a[i] > a[j]
. That is, an out-of-order pair.
One measure of being "almost sorted" is the number of inversions---one could take "almost sorted data" to mean data with few inversions. If one knows the number of inversions to be linear (for instance, you have just appended O(1) elements to a sorted list), insertion sort takes O(n) time.