views:

4141

answers:

14

Which sort algorithm works best on mostly sorted data?

+37  A: 

Based on the highly scientific method of watching animated gifs I would say Insertion and Bubble sorts are good candidates.

Tom Ritter
that's an excellent link by the way, kudos and a +1
ninesided
That link should be in every sorting question. Thanks!
Adam Davis
Bubble sort is terrible. It is always O(n^2). At least take that out of your answer for it to be right please.
jjnguy
jjnguy, that is just plain wrong. I think you need to re-take your algorithms class. On nearly sorted data (it's adaptive case) it is O(N). However, it takes 2 passes through the data and Insertion only takes 1 for nearly sorted data, which makes Insertion the winner. Bubble is still good though
Simucal
Bubble sort only needs something like two passes for each unsorted element. If the data truly only has a few unsorted elements, then bubble sort will be pretty snappy.
Sol
Performance degrades really badly if your data is ever not nearly sorted though. I would still not use it, personally.
Blorgbeard
It looks like Bubble > Shell > Merge > Heap. But Bubble and Shell appear to be front runners for mostly sorted data.
Thomas Owens
@Thomas Owens, Insertion > Bubble Or Shell for nearly sorted data.
Simucal
That link was broken when I tried it. Try this instead: http://www.sorting-algorithms.com/
Michael La Voie
+3  A: 

insertion or shell sort!

ninesided
A: 

Keep away from QuickSort - its very inefficient for pre-sorted data. Insertion sort handles almost sorted data well by moving as few values as possible.

Werg38
-1 Every industrial implementation of Quicksort has a reasonable pivot selection
Stephan Eggermont
A: 

Insertion sort is best case O(n) on sorted input. And it is very close on mostly sorted input (better than quick sort).

jjnguy
+8  A: 

Insertion sort with the following behavior:

  1. For each element k in slots 1..n, first check whether el[k] >= el[k-1]. If so, go to next element. (Obviously skip the first element.)
  2. If not, use binary-search in elements 1..k-1 to determine the insertion location, then scoot the elements over. (You might do this only if k>T where T is some threshold value; with small k this is overkill.)

This method makes the least number of comparisons.

Jason Cohen
I think bubble sort might beat this if the number of unsorted elements is very small (like, one or two), but in general this strikes me as probably the best solution.
Sol
Because of step 1, for any elements that are already sorted there is exactly one compare and zero data-moves, which is obviously the best you can do. Step 2 is the one you could improve on, but bubble will move the same number of elements and *might* have more compares, depending on your impl.
Jason Cohen
Actually, on further thought I think bubble sort is stronger than I was thinking. It's actually a fairly tricky question. For instance, if you take the case where the list is entirely sorted except the element which should be last is first, bubble sort will vastly outperform what you describe.
Sol
A: 

ponder Try Heap. I believe it's the most consistent of the O(n lg n) sorts.

Paul Nathan
+2  A: 

As everyone else said, be careful of naive Quicksort - that can have O(N^2) performance on sorted or nearly sorted data. Nevertheless, with an appropriate algorithm for choice of pivot (either random or median-of-three - see Choosing a Pivot for Quicksort), Quicksort will still work sanely.

In general, the difficulty with choosing algorithms such as insert sort is in deciding when the data is sufficiently out of order that Quicksort really would be quicker.

Jonathan Leffler
+3  A: 

Try introspective sort. http://en.wikipedia.org/wiki/Introsort

It's quicksort based, but it avoids the worst case behaviour that quicksort has for nearly sorted lists.

The trick is that this sort-algorithm detects the cases where quicksort goes into worst-case mode and switches to heap or merge sort. Nearly sorted partitions are detected by some non naiive partition method and small partitions are handled using insertion sort.

You get the best of all major sorting algorithms for the cost of a more code and complexity. And you can be sure you'll never run into worst case behaviour no matter how your data looks like.

If you're a C++ programmer check your std::sort algorithm. It may already use introspective sort internally.

Nils Pipenbrinck
+2  A: 

I'm not going to pretend to have all the answers here, because I think getting at the actual answers may require coding up the algorithms and profiling them against representative data samples. But I've been thinking about this question all evening, and here's what's occurred to me so far, and some guesses about what works best where.

Let N be the number of items total, M be the number out-of-order.

Bubble sort will have to make something like 2*M+1 passes through all N items. If M is very small (0, 1, 2?), I think this will be very hard to beat.

If M is small (say less than log N), insertion sort will have great average performance. However, unless there's a trick I'm not seeing, it will have very bad worst case performance. (Right? If the last item in the order comes first, then you have to insert every single item, as far as I can see, which will kill the performance.) I'm guessing there's a more reliable sorting algorithm out there for this case, but I don't know what it is.

If M is bigger (say equal or great than log N), introspective sort is almost certainly best.

Exception to all of that: If you actually know ahead of time which elements are unsorted, then your best bet will be to pull those items out, sort them using introspective sort, and merge the two sorted lists together into one sorted list. If you could quickly figure out which items are out of order, this would be a good general solution as well -- but I haven't been able to figure out a simple way to do this.

Further thoughts (overnight): If M+1 < N/M, then you can scan the list looking for a run of N/M in a row which are sorted, and then expand that run in either direction to find the out-of-order items. That will take at most 2N comparisons. You can then sort the unsorted items, and do a sorted merge on the two lists. Total comparisons should less than something like 4N+M log2(M), which is going to beat any non-specialized sorting routine, I think. (Even further thought: this is trickier than I was thinking, but I still think it's reasonably possible.)

Another interpretation of the question is that there may be many of out-of-order items, but they are very close to where they should be in the list. (Imagine starting with a sorted list and swapping every other item with the one that comes after it.) In that case I think bubble sort performs very well -- I think the number of passes will be proportional to the furthest out of place an item is. Insertion sort will work poorly, because every out of order item will trigger an insertion. I suspect introspective sort or something like that will work well, too.

Sol
+1  A: 

Splaysort is an obscure sorting method based on splay trees, a type of adaptive binary tree. Splaysort is good not only for partially sorted data, but also partially reverse-sorted data, or indeed any data that has any kind of pre-existing order. It is O(nlogn) in the general case, and O(n) in the case where the data is sorted in some way (forward, reverse, organ-pipe, etc.).

Its great advantage over insertion sort is that it doesn't revert to O(n^2) behaviour when the data isn't sorted at all, so you don't need to be absolutely sure that the data is partially sorted before using it.

Its disadvantage is the extra space overhead of the splay tree structure it needs, as well as the time required to build and destroy the splay tree. But depending on the size of data and amount of pre-sortedness that you expect, the overhead may be worth it for the increase in speed.

A paper on splaysort was published in Software--Practice & Experience.

TimB
+6  A: 

timsort

Timsort is "an adaptive, stable, natural mergesort" with "supernatural performance on many kinds of partially ordered arrays (less than lg(N!) comparisons needed, and as few as N-1)". Python's built-in sort() has used this algorithm for some time, apparently with good results. It's specifically designed to detect and take advantage of partially sorted subsequences in the input, which often occur in real datasets. It is often the case in the real world that comparisons are much more expensive than swapping items in a list, since one typically just swaps pointers, which very often makes timsort an excellent choice. However, if you know that your comparisons are always very cheap (writing a toy program to sort 32-bit integers, for instance), other algorithms exist that are likely to perform better. The easiest way to take advantage of timsort is of course to use Python, but since Python is open source you might also be able to borrow the code. Alternately, the description above contains more than enough detail to write your own implementation.

zaphod
log(n!) is Ο(n*log(n)) therefore It is not "supernatural".
J.F. Sebastian
Here's the Java implementation coming in JDK7: http://cr.openjdk.java.net/~martin/webrevs/openjdk7/timsort/raw_files/new/src/share/classes/java/util/TimSort.java
Tim
log(n!) is not fast.http://www.wolframalpha.com/input/?i=plot[log(N!),{N,0,1000}]
Behrooz
@J.F. Sebastian: timsort is much faster than `lg(n!)` comparisons on an almost-sorted array, all the way down to `O(n)`! | @behrooz: No comparison sort can have an average case of better than `O(n log n)`, and `lg(n!)` is `O(n log n)`. So timsort's worst case is asymptotically no worse than that of any other comparison sort. Furthermore its best case is better than or equal to any other comparison sort.
Artelius
A: 

If you are in need of specific implementation for sorting algorithms, data structures or anything that have a link to the above, could I recommend you the excellent "Data Structures and Algorithms" project on CodePlex?

It will have everything you need without reinventing the wheel.

Just my little grain of salt.

Maxim
A: 

Bubble-sort (or, safer yet, bi-directional bubble sort) is likely ideal for mostly sorted lists, though I bet a tweaked comb-sort (with a much lower initial gap size) would be a little faster when the list wasn't quite as perfectly sorted. Comb sort degrades to bubble-sort.

Brian
+2  A: 

Insertion sort takes time O(n + the number of inversions).

An inversion is a pair (i, j) such that i < j && a[i] > a[j]. That is, an out-of-order pair.

One measure of being "almost sorted" is the number of inversions---one could take "almost sorted data" to mean data with few inversions. If one knows the number of inversions to be linear (for instance, you have just appended O(1) elements to a sorted list), insertion sort takes O(n) time.

Jonas Kölker