ansaurus

Question

C Array sorting tips

Answer 1

+5 A:

In C, you can use the built in qsort command:

int compare( const void* a, const void* b)
{
     int int_a = * ( (int*) a );
     int int_b = * ( (int*) b );

     if ( int_a == int_b ) return 0;
     else if ( int_a < int_b ) return -1;
     else return 1;
}

qsort( a, 6, sizeof(int), compare )

see: http://www.cplusplus.com/reference/clibrary/cstdlib/qsort/

To answer the second part of your question: an optimal (comparison based) sorting algorithm is one that runs with O(n log(n)) comparisons. There are several that have this property (including quick sort, merge sort, heap sort, etc.), but which one to use depends on the specific case your in. Also, computers will typically use insertion sort or equivalent on really small arrays because for small enough datasets there isn't enough benefit from the recursion.

As a side note, you can sometime do better than O(n log(n)) if you know something about your data - see the wikipedia article on Radix Sort

Alex Reece 2010-10-08 20:09:31

Doesn't a compare function usually return 0 if they're equal? And why do you create `int_a` and `int_b` then never use them?

Nick T 2010-10-08 20:15:21

I accidentally posted before I had finished typing it up.

Alex Reece 2010-10-08 20:17:02

@Alex: if you want it fast, at least provide a decent compare function! qsort does not need the returned values to be -1, 0, 1, but "any negative number", 0, "any positive number", hence you just have to do `return *((int*)a)-*((int*)b);` which is much faster than your proposal.

kriss 2010-10-08 21:02:25

You fixed the return values, but I'm still confused why you're creating variables that you don't use. I'm not that great with C, I could be missing something.

Nick T 2010-10-08 21:04:54

@Nick, you're right. The last three lines should use `int_a` and `int_b`; I'll change them. @kriss's solution is also good.

Matthew Flaschen 2010-10-08 21:58:42

@kriss: your comparison isn't well-defined in case of integer overflow; therefore, one often sees things like `return (a > b) - (a < b)`

Christoph 2010-10-08 22:04:26

@kriss: except that comparison function doesn't (necessarily) work. What happens if `a` is `INT_MAX` and `b` is `-1`, for example?

Stephen Canon 2010-10-08 22:04:49

@Stephen Canon: Agreed, you should use formulas like Christoph's when you know nothing of your data range and overflow can happen. In practical case I never saw a single occurence when dealing with signed numbers dans I didn't had some rough idea of the data range (and my formulas is also fine for unsigneds). My point was mostly that compare API result type is not -1,0,1 (or we couldn't even use strcmp for comparing char*).

kriss 2010-10-09 02:34:10

Quick sort does not run in `O(n log n)` time. It runs in `O(n^2)` time. People who claim otherwise need to look up what big O means.

R.. 2010-10-09 03:18:53

@kriss: returning the difference **does not work** due to integer overflow issues! Unless you're adept at avoiding overflows, you'd better just use a conditional and return -1/0/1, which is much less error-prone.

R.. 2010-10-09 03:20:19

One thing to note: the take-the-difference approach **does work** whenever the data type is strictly smaller in range than `int`.

R.. 2010-10-09 03:27:04

@R..: Yes, it works is domain of value is small enough and that is often the case. And for unsigned it works provided both unsigned are smaller than INT_MAX (not UINT_MAX). Usually I'm using C programming language when I want (real) speed or bit level control. If problem lies elsewhere (complexity of algorithms or such) python is usually my prefered choice. But OK, when doing such things you'd better know exactly what you are doing.

kriss 2010-10-09 15:44:45

@R..: big O does not implies worst case, QuickSort is O(n log n) on average and naive implementation can be changed it such a way it behave the same in worst case (just need a pivot choice from more values). Change is small enough that the modified version is usually still called QuickSort.

kriss 2010-10-09 15:49:54

@R..: with the **naive implementation** of QuickSort when you always take say the first item as pivot and when you are looking for the worst case complexity it is indeed O(n^2). But with randomized data it is proved that QuickSort's complexity is O(n.log(n)) on average (and that's probably what people claiming that QuickSort is O(n.log(n)) are claiming). Big O notation does not implies you are speking of worst case. Second, very minor changes to QuickSort (so minor you still call it QuickSort) can make it O(n.log(n)) in worst case).

kriss 2010-10-09 08:01:50

@R...: you probably haven't read my comment above. With real data you usually work in a domain much smaller than int domain. But when programming with C people just call their types int, even if they are not really ints and that data can never goes out of domain. In many cases domain restrictions guarantee you'll never get an overflow. In these case (indeed restricted, but with an acceptable restriction) the method works fine. Another similar typical case is with unsigneds if you known your input values are in range 0..INT_MAX (not UINT_MAX).

kriss 2010-10-09 08:11:57

@kriss: This use of notation is simply wrong. Even if it's randomized, it **can** hit cases where it takes quadratic time. Therefore, the **big O is quadratic**. Big O **always** means **worst case**. Use different notations for ridiculous "average case" complexity estimates.

R.. 2010-10-09 09:45:19

@kriss: If I say an algorithm is `O(f(n))` in time, that means the time it takes to run is **bounded by a constant multiple of `f(n)`**, where the particular constant is implementation-dependent but constant within an implementation, for **all possible inputs**. Claiming quicksort is `O(n log n)` is as absurd as claiming `if (rand()==42) return find_prime_factors(n); else return NULL;` is `O(1)` with respect to `n`.

R.. 2010-10-09 22:10:43

@R. Actually, there is a version of quicksort that is guaranteed to run in O(n log n ) - use quickselect and median of 5 to find the true median in O( n ) and then recurse on the appropriate halves. `T(n) = n + 2T(n/2) = O( n log n )`

Alex Reece 2010-10-09 23:37:27

@R..: are you confusing data and size of data ? Big-O notation parameter is size of data, but what is averaged is data. Have a look at http://en.wikipedia.org/wiki/Average_case_analysis#Worst-case_versus_average-case_performance. What I remember from my Univerity time is as follow: consider all possible inputs to function f(). Some (T1) have a `k1*n*log(n)` runtime, some others (T2) degenerate and are `k2*n^2`. Let r(n) be number of T2 cases. `average O of f(n)` is `O(((n-r(n))*k2*n^2 + r(n)*k1*n*log(n))/n)` Say number of degenerating case is log(n) or less then average `O(f(n)) = O(log(n)*n)`

kriss 2010-10-10 21:21:36

@Alex: okay, I found the reference on Wikipedia, and apparently a variant of quicksort can be made to run in `O(n log n)` time. I would argue that this is a sufficiently more advanced algorithm that it's not fare to equate it with quicksort, but the pivot principle is the same.

R.. 2010-10-10 23:28:36

@kriss: averaging is **absolutely irrelevant**. Big O is a matter of bounding and has nothing to do with average performance. My example with `rand()` was that it's easy to write a function where average performance is fast but worst-case is arbitrarily slow. As Alex has pointed out, it's apparently possible to make a variant of quicksort that runs in `O(n log n)` time, but your use of big O terminology is still incorrect.

R.. 2010-10-10 23:31:44

@R..: Well, space is a bit short in comments to give you the maths, I thought my above exemple was enough, but seems not to be (did you read it anyway ?). I guess I'll have to open a question on the subject. I you are curious You can find full math and details on many sites. Big O is often used as "worst case" because if your worst case has a good complexity you'll have a low run-time. But if you work on criptanalysis what you'll be looking for is "best-case" because you want your code to be hard to break for any input. Average case is in between and big O is applied **after** averaging.

kriss 2010-10-11 08:25:52

Answer 2

+5 A:

In your particular case the fastest sort is probably the one described in this answer. It is exactly optimized for an array of 6 ints and uses sorting networks. It is 50 times (measured on x86) faster than library qsort. Sorting networks are optimal for sort of fixed length arrays. As they are a fixed sequence of instructions they can even be implemented easily by hardware.

Here is the C code for your example dataset:

#include <stdio.h>

static inline void sort6_sorting_network_v4(int * d){
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

int main(){
    char a[6] = {1,3,6,7,1,2};
    sort6_sorting_network_v4(a);
    printf("%d %d %d %d %d %d\n",a[0], a[1], a[2], a[3], a[4], a[5], a[6]);
}

Generally speaking there is many sorting algorithms optimized for some specialized case. The general purpose algorithms like heap sort or quick sort are optimized for in place sorting of an array of items. They yield a complexity of O(n.log(n)), n being the number of items to sort.

The library function qsort() is very well coded and efficient in terms of complexity, but uses a call to some comparizon function provided by user, and this call has a quite high cost.

For sorting very large amount of datas algorithms have also to take care of swapping of data to and from disk, this is the kind of sorts implemented in databases and your best bet if you have such needs is to put datas in some database and use the built in sort.

kriss 2010-10-08 20:13:46

+1 for sorting networks

R.. 2010-10-09 03:25:31

Answer 3

A:

It depends on various things. But in general algorithms using a Divide-and-Conquer / dichotomic approach will perform well for sorting problems as they present interesting average-case complexities.

To understand which algorithms works best, you will need basic knowledge of algorithms complexity and big-O notation, so you can understand how they rate in therms of average case, best case and worst case scenarios.

That being said, usually an efficient algorithm is quicksort. However, if you give quicksort a perfectly inverted list, then it will perform poorly (a simple selection sort will perform better in that case!). Shell-sort would also usually be a good complement to quicksort if you perform a pre-analysis of your list.

Have a look at the following, for "advanced searches" using divide and conquer approaches:

And these more straighforward algorithms for less complex ones:

Also, as pointed out by R. in the comments and by kriss in his answer, you may want to have a look at HeapSort, which provides a theoretically better sorting complexity than a quicksort (but will won't often fare better in practical settings).

haylem 2010-10-08 20:23:06

If you provide a perfectly inverted list to quicksort it will degenerate only in the most naive implementation (allways take head of the list as pivot) and even then it won't be worse that BubbleSort. The naive Quicksort would also perform poorly with an already sorted list. But very simple changes to the algorithm are enough to avoid the problem (extract several numbers from the list as potential pivot and choose median as pivot).

kriss 2010-10-08 20:56:26

@kriss: Correct. But this is a CS-learning question, and so I just talk about the theoretical and basic implementation of each of these approaches. Obviously you can tweak algorithms and minimize these side-effects, but as the OP is asking about general sorting issues, I think it's more in-line to pinpoint these issues.

haylem 2010-10-08 23:33:33

@haylem: it's indeed probably a learning question, but the risk speaking about naive implementations is for the reader to believe that the library call qsort is a naive implementation of QuickSort, which it is not, and would degenerate on sorted data set. If I remember correctly it is not even a QuickSort in most implementations.

kriss 2010-10-09 02:41:07

You left out **heap sort**, which is quite arguably the ideal sort (`O(1)` space and `O(n log n)` time).

R.. 2010-10-09 03:24:23

@kriss: Thanks for the corrections.

haylem 2010-10-09 09:58:42

@R.: I left out many of them I guess :) But you're right, I should have mentioned heap-sort.

haylem 2010-10-09 09:59:15

ansaurus

tags:

views:

answers:

C Array sorting tips

related questions