views:

245

answers:

8

Inspired by this question

The choice of which algorithm to use to sort a collection can be made better if we know ahead of time how well sorted a collection is. Is there a way we can measure (or maintain a measurement) of how well sorted the collection is? Can we do this in such a way that the cost of maintaining or measuring how well sorted something doesn't outweigh the benefit of choosing the best sort algorithm?

+2  A: 

You could use sampling: Check N elements spaced evenly in the list and see how many are in order. (Of course that only works in a random-access list, but usually that's the type you sort.)

Also have a threshold for small N. If N is small (e.g. 10) insertion sort is good even if the list isn't sorted. Java makes this optimization for small N in what is otherwise a merge-sort.

Jason Cohen
+1  A: 

One propsed solution:

Maintain the number of operations (insertions/deletions) performed since the last sort. The higher this number, the more unsorted the collection probably is.

Doug T.
+1  A: 

If you don't know anything a priori about the collection, any time spent trying to instrument its sortedness will be far greater than the savings you would get by picking the optimal sorting algorithm.

If, on the other hand, you're going to be sorting many data sets that all have similar amounts of sortedness, you can measure the first dataset, pick an algorithm, and then use that one for all subsequent data sets.

Adam Rosenfield
+2  A: 

Augmenting @Doug:

A deletion can never make the list less sorted, so you don't have to track those.

When an insertion happens, compare with the elements around to determine if this insertion was in-order or not. If yes, don't increase the counter. If no, increase the "not sorted" counter.

Perhaps this is too much of a penalty (i.e. two compares per insert). You could do only one compare for a more fuzzy result? Or I do like the idea of just counting inserts.

Jason Cohen
good point, I like this method. I guess you have to decide if knowing the sortedness is worth the cost of the two compares.
Doug T.
+2  A: 

You can measure the frequency of the data - if there are a lot of big changes from item to item, then the data is high frequency, indicating a pretty random distribution.

If the changes are smaller, then the data is low frequency - indicating a non-random distribution.

You can also measure the general trend using a filter - is the average trend measurably downwards or upwards - if downwards you might consider flipping the whole array or using a sort good for 'reversed' data.

There are other measurements you can use the might give yo insight - check out signal processing and see what you can glean.

Adam Davis
+2  A: 

There is the Introspective Sort which does exactly that, sort of...

http://ralphunden.net/content/tutorials/a-guide-to-introsort/

Vinko Vrsalovic
+1, mate.. Everyone should know about that algorithm.
Nils Pipenbrinck
I learned that from you :-) thanks!
Vinko Vrsalovic
*lol* Nice to hear that..
Nils Pipenbrinck
A: 

Well, first check if the collection is sorted by definition, that will always save you a bunch of time :) For the most part, don't bother extending a collection to test if it is sorted during its insert/delete operations, if the collection needs to be sorted, use a collection that is sorted by definition.

If you are trying to extend a collections class to track for sorting, just keep a separate sorted list of pointers to elements in the collection...

Finally , for 99.99% of the time, why bother? Just use a quicksort. If your dataset is small enough that the constant portion of Big O sorting on quicksort is going to override timesavings from bubble sort, the sort will be so fast you shouldn't even waste time asking the question.

Are you really telling me your question is the .01% of sorting that this needs to be addressed?

Zak
I agree. Honestly, this is mostly an interesting academic excersize in "premature optimization" ;).
Doug T.
I think the question presumes that quicksort alone is not fast enough. Sure, this rarely occurs, but the question is "what would I do if it did occur", not "tell me to get back to work writing boring, easy code to cover boring, easy cases" ;-)
Steve Jessop
A: 

This is a great question.. my approach for solving this question would be to ask: Given a list of items, what is the popability of choosing two consecutive items from the list which are sorted. As the list becomes more sorted, the probability will approach 100%.

To calculate this probability is relatively simple: int sorted = 0; for (int i = 0; i < list_length; i++) { if (list[i+1] >= list[i]) { sorted++; } } sortedness = sorted/(list_length-1);

I hope this helps!

-Aaron Couture

Aaron Couture
Sorry I didn't realize that you wanted a method that would improve the sort time an althorithm. Obviously my method of measurement wouldn't work since it is O(N)
Aaron Couture