Answering to another StackOverflow question (this one) I stumbled upon an interresting sub-problem. What is the fastest way to sort an array of 6 ints ?
As the question is very low level (will be executed by a GPU):
- we can't assume libraries are available (and the call itself has it's cost), only plain C
- to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
- room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.
Really this question is a kind of Golf where the goal is not to minimize source length but execution speed. I call it 'Zening` code as used in the title of the book Zen of Code optimization by Michael Abrash and it's sequels.
Here is my reference (naive, not optimized) implementation and my test set.
#include <stdio.h>
static __inline__ int sort6(int * d){
char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
unsigned long long cycles = rdtsc();
for (i = 0; i < 6 ; i++){
sort6(d[i]);
/*
* printf("d%d : %d %d %d %d %d %d\n", i,
* d[i][0], d[i][6], d[i][7],
* d[i][8], d[i][9], d[i][10]);
*/
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
}
Raw results
As number of variants is becoming large, I gathered them all in a test suite that can be found here. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset). I will wait a couple of days to be sure there is no other interesting proposal and will probably give the answer to Daniel Stutzbach (for golfing) as he is at the source of the fastest solution so far.
Linux 32 bits, gcc 4.4.1, Intel Core 2 Quad Q8300, -O2
- Direct call to qsort library function : 24712
- Naive implementation (insertion sort) : 2145
- Insertion Sort (Daniel Stutzbach) : 1425
- Insertion Sort Unrolled : 1103
- Sorting Networks (Daniel Stutzbach) : 1080
- Sorting Networks (Paul R) : 585
- Sorting Networks 12 with Fast Swap : 518
- Rank Order (Rex Kerr) : 930
- Sorting Networks 12 reordered Swap : 480
Linux 64 bits, gcc 4.4.3 64 bits, Intel Core 2 Duo E8400, -O1
- Direct call to qsort library function : 16461
- Naive implementation (insertion sort) : 1008
- Insertion Sort (Daniel Stutzbach) : 882
- Insertion Sort Unrolled : 1233
- Sorting Networks (Daniel Stutzbach) : 792
- Sorting Networks (Paul R) : 459
- Sorting Networks 12 with Fast Swap : 432
- Rank Order (Rex Kerr) : 405
- Sorting Networks 12 reordered Swap : 351
Linux 64 bits, gcc 4.4.3 64 bits, Intel Core 2 Duo E8400, -O2
I included both -OA and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?
- Direct call to qsort library function : 16353
- Naive implementation (insertion sort) : 1575
- Insertion Sort (Daniel Stutzbach) : 1134
- Insertion Sort Unrolled : 873
- Sorting Networks (Daniel Stutzbach) : 828
- Sorting Networks (Paul R) : 441
- Sorting Networks 12 with Fast Swap : 405
- Rank Order (Rex Kerr) : 630
- Sorting Networks 12 reordered Swap : 369
Comments on proposed solutions
Insertion Sort (Daniel Stutzbach)
As expected minimizing branches is indeed a good idea.
Sorting Networks (Daniel Stutzbach)
Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).
Sorting Networks (Paul R)
The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?
Sorting Networks 12 SWAP with Fast Swap
As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.
Calling Library qsort
To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... I believe the main problem is that the linker defeats compiler optimizations : it can't inline this call.
I wrote 10 to 30 times slower because I also tried the call to qsort() source code on another machine using 64 bits Linux (my reference station is a 32 bits OS) but otherwise identical. The library call went down to about 8000 cycles instead of 25000. On this machine the other sorts implementations stay roughly at the same figures.
Rank order
Rex Kerr proposed another completely different method : for each item of the array compute compute directly it's final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightlty below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).
Sorting Networks 12 with reordered Swap
The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory use could be faster than branchless sorting networks ? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2);
you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);
the processor can execute both in parallel. I tried it and if works as expected, the sorting networks is running about 10% faster that way and is the fastest so far.
THe code is as follow:
static inline void sort6_sorting_network_v4(int * d){
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
If we believe our test set, the average number of cycles of the resulting code for one sort is around 70 cycles (6 tests are executed). That put each swap at an average of 5 cycles. I call that amazingly fast. But yet it may be possible to enhance it. The conditional SWAP used above is basically a sort2. And we can easily rewrite the above sequence of swaps as 3 sort3 and 3 swaps. Still I haven't found any implementation of sort3(a, b, c)
faster than SWAP(b, c); SWAP(a, c); SWAP(a, b);
using the above implementation. But if we can find one faster than 15 cycles, it should give a better solution.