views:

648

answers:

3

Answering to another StackOverflow question (this one) I stumbled upon an interresting sub-problem. What is the fastest way to sort an array of 6 ints ?

As the question is very low level (will be executed by a GPU):

  • we can't assume libraries are available (and the call itself has it's cost), only plain C
  • to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
  • room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution speed. I call it 'Zening` code as used in the title of the book Zen of Code optimization by Michael Abrash and it's sequels.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset). I will wait a couple of days to be sure there is no other interesting proposal and will probably give the answer to Daniel Stutzbach (for golfing) as he is at the source of the fastest solution so far.

Linux 32 bits, gcc 4.4.1, Intel Core 2 Quad Q8300, -O2

  • Direct call to qsort library function : 24712
  • Naive implementation (insertion sort) : 2145
  • Insertion Sort (Daniel Stutzbach) : 1425
  • Insertion Sort Unrolled : 1103
  • Sorting Networks (Daniel Stutzbach) : 1080
  • Sorting Networks (Paul R) : 585
  • Sorting Networks 12 with Fast Swap : 518
  • Rank Order (Rex Kerr) : 930
  • Sorting Networks 12 reordered Swap : 480

Linux 64 bits, gcc 4.4.3 64 bits, Intel Core 2 Duo E8400, -O1

  • Direct call to qsort library function : 16461
  • Naive implementation (insertion sort) : 1008
  • Insertion Sort (Daniel Stutzbach) : 882
  • Insertion Sort Unrolled : 1233
  • Sorting Networks (Daniel Stutzbach) : 792
  • Sorting Networks (Paul R) : 459
  • Sorting Networks 12 with Fast Swap : 432
  • Rank Order (Rex Kerr) : 405
  • Sorting Networks 12 reordered Swap : 351

Linux 64 bits, gcc 4.4.3 64 bits, Intel Core 2 Duo E8400, -O2

I included both -OA and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

  • Direct call to qsort library function : 16353
  • Naive implementation (insertion sort) : 1575
  • Insertion Sort (Daniel Stutzbach) : 1134
  • Insertion Sort Unrolled : 873
  • Sorting Networks (Daniel Stutzbach) : 828
  • Sorting Networks (Paul R) : 441
  • Sorting Networks 12 with Fast Swap : 405
  • Rank Order (Rex Kerr) : 630
  • Sorting Networks 12 reordered Swap : 369

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... I believe the main problem is that the linker defeats compiler optimizations : it can't inline this call.

I wrote 10 to 30 times slower because I also tried the call to qsort() source code on another machine using 64 bits Linux (my reference station is a 32 bits OS) but otherwise identical. The library call went down to about 8000 cycles instead of 25000. On this machine the other sorts implementations stay roughly at the same figures.

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute compute directly it's final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightlty below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory use could be faster than branchless sorting networks ? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and if works as expected, the sorting networks is running about 10% faster that way and is the fastest so far.

THe code is as follow:

static inline void sort6_sorting_network_v4(int * d){
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set, the average number of cycles of the resulting code for one sort is around 70 cycles (6 tests are executed). That put each swap at an average of 5 cycles. I call that amazingly fast. But yet it may be possible to enhance it. The conditional SWAP used above is basically a sort2. And we can easily rewrite the above sequence of swaps as 3 sort3 and 3 swaps. Still I haven't found any implementation of sort3(a, b, c) faster than SWAP(b, c); SWAP(a, c); SWAP(a, b); using the above implementation. But if we can find one faster than 15 cycles, it should give a better solution.

+11  A: 

Here's an implementation using sorting networks:

inline void Sort2(int *p0, int *p1)
{
    const int temp = min(*p0, *p1);
    *p1 = max(*p0, *p1);
    *p0 = temp;
}

inline void Sort3(int *p0, int *p1, int *p2)
{
    Sort2(p0, p1);
    Sort2(p1, p2);
    Sort2(p0, p1);
}

inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
    Sort2(p0, p1);
    Sort2(p2, p3);
    Sort2(p0, p2);  
    Sort2(p1, p3);  
    Sort2(p1, p2);  
}

inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
    Sort3(p0, p1, p2);
    Sort3(p3, p4, p5);
    Sort2(p0, p3);  
    Sort2(p2, p5);  
    Sort4(p1, p2, p3, p4);  
}

You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a string of min and max operations (13 of each, in total). I leave this as an exercise for the reader.

Note that this implementation lends itself easily to vectorization (SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).

See also: http://stackoverflow.com/questions/2748749/fast-algorithm-implementation-to-sort-very-small-set/

Paul R
For some bit hacks for min/max: http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
Rubys
@Paul: in the real CUDA use context, it's certainly the best answer. I will check if it also is (and how much) in golf x64 context and publish result.
kriss
`Sort3` would be faster (on most architectures, anyway) if you noted that `(a+b+c)-(min+max)` is the central number.
Rex Kerr
@Rex: interesting idea - it would require a widening of the data though, to prevent overflow, which would mean a performance impact in some cases (especially SIMD). It would be interesting to count the operations though: the above implementation of Sort3 is 3 `max` and 3 `min` operations for a total of 6 - how many operations do you think your method would be ?
Paul R
@Paul: Overflow doesn't matter--you underflow back into range again (unless this is some weird architecture that doesn't do integer math mod 2^32). My method is 1 min, 1 max, 2 add, 2 sub--and add/sub are usually faster than min/max. If they're the same, it should be equivalent.
Rex Kerr
@Rex: I see - that looks good. For SIMD architectures like AltiVec and SSE it would be the same number of instruction cycles (max and min are single cycle instructions like add/subtract), but for a normal scalar CPU your method looks better.
Paul R
+12  A: 

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

static __inline__ int sort6(int *d){
        int i, j;
        for (i = 1; i < 6; i++) {
                int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }
}

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}
Daniel Stutzbach
+1: nice, you did it with 12 exchanges rather than the 13 in my hand-coded and empirically derived network above. I'd give you another +1 if I could for the link to the site that generates networks for you - now bookmarked.
Paul R
The macro should be `if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }` I corrected it before trying code.
kriss
This is a fantastic idea for a general purpose sorting function if you expect the majority of requests to be small sized arrays. Use a switch statement for the cases that you want to optimize, using this procedure; let the default case use a library sort function.
Mark Ransom
@kriss Thanks. I fixed the macro.
Daniel Stutzbach
@Mark A *good* library sort function will already have a fast-path for small arrays. Many modern libraries will use a recursive QuickSort or MergeSort that switches to InsertionSort after recursing down to `n < SMALL_CONSTANT`.
Daniel Stutzbach
@Daniel, good point; I should have thought of that. Doesn't that imply the correct answer to the question then is to just use the library sort?
Mark Ransom
@Mark: I believe the cost you have to pay just to call the library function (instead of static inline) is so high it defeats the library optimizations. But you are right, I should provide figures for plain library call to give a reference point.
kriss
@Mark Well, a C library sort function requires that you specify the comparison operation via a function porter. The overhead of calling a function for every comparison is huge. Usually, that's still the cleanest way to go, because this is rarely a critical path in the program. However, if it is the critical path, we really can sort much faster if we know we're sorting integers and exactly 6 of them. :)
Daniel Stutzbach
+2  A: 

Since these are integers and compares are fast, why not compute the rank order of each directly:

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}
Rex Kerr
@Rex: with gcc -O1 it's below 1000 cycles, quite fast but slower than sorting network. Any idea to improve code ? Maybe if we could avoid array copy...
kriss
@kriss: It's faster than the sorting network for me with -O2. Is there some reason why -O2 isn't okay, or is it slower for you on -O2 also? Maybe it's a difference in machine architecture?
Rex Kerr
@Rex: O2 is indeed also the best option for me with your program. Cycle count is around 950 (I launch program several times as result is never perfectly stable). Thus it is faster than the first Network Sort implementation (the one without branchless swap) but slower than the other two. But you are right, target architecture or exact processor model can make a difference. 400 cycles is not a big difference. My testing target is an Intel Core2 Quad [email protected], stepping 0a (though with testing method frequency should be irrelevant).
kriss
@Rex: I also wonder if your method is really working on every dataset. I wonder if you do not have cases where several values are mapped to the same place when sorted data are repeated.
kriss
@Rex: sorry, I missed the > vs >= pattern at first sight. It works in every case.
kriss
@Rex: I tried your code on my other test machine (Intel Core 2 E8400 @ 3GHz with native Linux 64bits OS) and on it your program is the fastest (~370 cycles vs ~390). I should edit my question to provide results for both architectures (with your answer).
kriss
@kriss: I think a factor of 2 difference in cycles is quite large, especially since I was testing on a 2-core machine of the same vintage as the Q8300!
Rex Kerr
@Rex; I updated my answer. The true reason was version of compiler (gcc441 vs gcc 443) not target architecture. I didn't identified exactly what optimization. Your solution seems to hard push gcc. For example gcc443 yield much better results with O1 than with O2). I guess I will have to look at assembly code if I really want to understand why.
kriss
@kriss: Aha. That is not completely surprising--there are a lot of variables floating around, and they have to be carefully ordered and cached in registers and so on.
Rex Kerr