views:

1102

answers:

7

I have an array of values which is almost, but not quite sorted, with a few values displaced (say, 50 in 100000). How to sort it most efficiently? (performance is absolutely crucial here and should be way faster than O(N)).

I know about smoothsort, but I can't find Java implementation. Does anyone know whether it is already implemented? Or what I can use for this task instead of smoothsort?

+6  A: 

As Botz3000 noted, you can't perform such an operation faster than O(N). The most basic element of any algorithm would be to find those entries in the array that are out of order. This requires O(N), even before you figure out what to do with them.

If indeed the number of "out-of-order" elements is orders of magnitude below the total number of elements, you could use the following algorithm (assuming linked list):

  1. Find all out-of-order items and extract the from the original list to a separate list, O(N)
  2. The result is two lists: a sorted list and a short extracted list
  3. For each of the extracted elements, insert them into the sorted list. That would be O(log(N)) for each, total is O(Xlog(N)), where X is the number of the extracted elements. If X is very small relative to N, you end up with a total of O(N).
Roee Adler
Actually is not that difficult to prove. You can prove it by calculating the limit of xlog(n) when x/n -> 0.
Martinho Fernandes
How to you derive the O(log N) in step 3? When using a linked list, binary search might be hard to do...
Dirk
@Dirk: even if you can'd do binary search, you can insert all elements in O(XN) time, which for small Xs, yields O(N).
Martinho Fernandes
@Martinho: thanks, good point, I removed the "difficult" statement.
Roee Adler
Step 1 in your algorithm is a major flaw. You can't easilly *define*, let alone *select* out-of-order elements. Especially for O(N).
Pavel Shved
@Martinho: Yes. For X sufficiently small when compared to N the algorithm is essentially O(N). I just stumbled over the "log N".
Dirk
@Pavel: I don't understand your comment. If you go over a sorted list, and test for all items whether a[i-1] <= a[i] <= a[i+1], you can easily find all out-of-order elements in O(N)
Roee Adler
@Dirk: despite my correction you still raise an interesting point: binary search in a linked list is hard at best, impossible at worse. If I know how many elements are in the list I can take advantage of that and in step 1 get some anchors for binary search later on.
Martinho Fernandes
@Rax: the invariant is even weaker. a[i-1] <= a[i] for all i > 1 is enough.
Martinho Fernandes
@Rax: yes, you can find them, but it may happen that the *amount of so-called "out-of-order" elements will also happen to be O(N)*. Even if the array had smaller number of them when you look at it. Consider [1,5,10,2,3,4,5,6,7,8,9,10,11]. Brute-force algorithm will call elements 2 to 10 out-of-order. That won't make any sense on the later steps.
Pavel Shved
@Pavel: thanks for pointing that.
Martinho Fernandes
As cleared in this question: http://stackoverflow.com/questions/1390960/how-do-i-select-all-elements-in-a-list-that-are-out-of-order, step one would take O(N log N) time, overshadowing the other parts of the algorithm, so it won't run in O(N) time.
Martinho Fernandes
+8  A: 

Actualy, Wikipedia contains Java implementation of smoothsort: http://en.wikipedia.org/wiki/Smoothsort.

Superfilin
+4  A: 

Cocktail Sort

If you want a simple algorithm that's easy to implement, you could do a cocktail sort. It would work reasonably well on nearly-sorted input.

DigitalRoss
+1 for teaching me about the name for cocktail-sort.
Henk Holterman
@Henk: same here. In high school I implemented it as an improvement over bubble sort, but I never knew there was a name for it.
Martinho Fernandes
A: 

You are right about the impossibility of reaching O(N), but assuming a multi-core machine (which I have), we can cheat a little by using a parallel sorting algorithm.

Alexander Temerev
Well, the algorithm still runs in O(N). O(N/2) is O(N).
Martinho Fernandes
@Martinho: unless you have a number of cores that grows with the size of the input :)
Roee Adler
@Rax: something that takes advantage of parallel universes, like quantum bogosort http://en.wikipedia.org/wiki/Bogosort#Quantum_Bogosort?
Martinho Fernandes
@Martinho: hilarious, wasn't aware of it
Roee Adler
By the way, if you use a parallel sort (especially those of the divide-and-conquer variety) don't forget to use a cutoff to switch to a sequential sort at some point, when the parallelism overhead outweights its benefits.
Martinho Fernandes
A: 

Implement what we called in school a Shell's sort. That's bubblesorting sub-arrays. A sub-array with step k is an array of elements with indicies 0, k, 2k, 3k...

If you choose k = 3i+1, and perform multiple bubble sorts, starting from higher i-s downto 0, the times will be smaller on nearly-sorted array.

Pavel Shved
+3  A: 

[Sun] JDK7 has (or will have) an implementation of Tim sort (from Python). It's a merge sort that takes advantage of order already existing in the array.

Tom Hawtin - tackline
Could you please either provide a link or explain yourself how tim sort works?
Martinho Fernandes
Ok, found it and added the link.
Martinho Fernandes
Tim sort was first implemented for python, and when some java guy (I think it was josh bloch) saw it, he said "we need to get that into java" http://svn.python.org/projects/python/trunk/Objects/listsort.txt
Peter Recore
or of course http://stackoverflow.com/search?q=timsort
Peter Recore
I said the same thing! Thanks for pointing out this pretty good, but less known algorithm.
Martinho Fernandes
+1  A: 

Smoothsort or Timsort are great algorithms, and would be sensible things to use.

I'd add that what you might not realise is that the humble insertion sort is adaptive. Indeed, for really almost sorted lists, as you seem to have, my understanding (which i can't back up with a reference) is that it's faster than the more sophisticated algorithms. The problem is that if the input isn't almost-sorted, it rapidly degrades to O(n^2). Still, it's very simple to implement correctly, so if you know for sure that your input is always almost-sorted, it would be a good choice.

Tom Anderson