ansaurus

Question

Fastest sort of fixed length 6 int array

Answer 1

+11 A:

Here's an implementation using sorting networks:

inline void Sort2(int *p0, int *p1)
{
    const int temp = min(*p0, *p1);
    *p1 = max(*p0, *p1);
    *p0 = temp;
}

inline void Sort3(int *p0, int *p1, int *p2)
{
    Sort2(p0, p1);
    Sort2(p1, p2);
    Sort2(p0, p1);
}

inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
    Sort2(p0, p1);
    Sort2(p2, p3);
    Sort2(p0, p2);  
    Sort2(p1, p3);  
    Sort2(p1, p2);  
}

inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
    Sort3(p0, p1, p2);
    Sort3(p3, p4, p5);
    Sort2(p0, p3);  
    Sort2(p2, p5);  
    Sort4(p1, p2, p3, p4);  
}

You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a string of min and max operations (13 of each, in total). I leave this as an exercise for the reader.

Note that this implementation lends itself easily to vectorization (SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).

See also: http://stackoverflow.com/questions/2748749/fast-algorithm-implementation-to-sort-very-small-set/

Paul R 2010-05-07 07:37:30

For some bit hacks for min/max: http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

Rubys 2010-05-07 08:20:15

@Paul: in the real CUDA use context, it's certainly the best answer. I will check if it also is (and how much) in golf x64 context and publish result.

kriss 2010-05-07 12:06:57

`Sort3` would be faster (on most architectures, anyway) if you noted that `(a+b+c)-(min+max)` is the central number.

Rex Kerr 2010-05-07 23:35:41

@Rex: interesting idea - it would require a widening of the data though, to prevent overflow, which would mean a performance impact in some cases (especially SIMD). It would be interesting to count the operations though: the above implementation of Sort3 is 3 `max` and 3 `min` operations for a total of 6 - how many operations do you think your method would be ?

Paul R 2010-05-08 07:16:59

@Paul: Overflow doesn't matter--you underflow back into range again (unless this is some weird architecture that doesn't do integer math mod 2^32). My method is 1 min, 1 max, 2 add, 2 sub--and add/sub are usually faster than min/max. If they're the same, it should be equivalent.

Rex Kerr 2010-05-08 12:55:06

@Rex: I see - that looks good. For SIMD architectures like AltiVec and SSE it would be the same number of instruction cycles (max and min are single cycle instructions like add/subtract), but for a normal scalar CPU your method looks better.

Paul R 2010-05-08 16:13:18

Answer 2

+12 A:

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

static __inline__ int sort6(int *d){
        int i, j;
        for (i = 1; i < 6; i++) {
                int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }
}

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}

Daniel Stutzbach 2010-05-07 15:02:00

+1: nice, you did it with 12 exchanges rather than the 13 in my hand-coded and empirically derived network above. I'd give you another +1 if I could for the link to the site that generates networks for you - now bookmarked.

Paul R 2010-05-07 20:52:46

The macro should be `if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }` I corrected it before trying code.

kriss 2010-05-07 21:36:54

This is a fantastic idea for a general purpose sorting function if you expect the majority of requests to be small sized arrays. Use a switch statement for the cases that you want to optimize, using this procedure; let the default case use a library sort function.

Mark Ransom 2010-05-07 21:47:31

@kriss Thanks. I fixed the macro.

Daniel Stutzbach 2010-05-07 22:11:39

@Mark A *good* library sort function will already have a fast-path for small arrays. Many modern libraries will use a recursive QuickSort or MergeSort that switches to InsertionSort after recursing down to `n < SMALL_CONSTANT`.

Daniel Stutzbach 2010-05-07 22:16:16

@Daniel, good point; I should have thought of that. Doesn't that imply the correct answer to the question then is to just use the library sort?

Mark Ransom 2010-05-07 22:25:49

@Mark: I believe the cost you have to pay just to call the library function (instead of static inline) is so high it defeats the library optimizations. But you are right, I should provide figures for plain library call to give a reference point.

kriss 2010-05-07 22:47:52

@Mark Well, a C library sort function requires that you specify the comparison operation via a function porter. The overhead of calling a function for every comparison is huge. Usually, that's still the cleanest way to go, because this is rarely a critical path in the program. However, if it is the critical path, we really can sort much faster if we know we're sorting integers and exactly 6 of them. :)

Daniel Stutzbach 2010-05-07 23:08:05

Answer 3

+2 A:

Since these are integers and compares are fast, why not compute the rank order of each directly:

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

Rex Kerr 2010-05-07 23:19:00

@Rex: with gcc -O1 it's below 1000 cycles, quite fast but slower than sorting network. Any idea to improve code ? Maybe if we could avoid array copy...

kriss 2010-05-07 23:47:59

@kriss: It's faster than the sorting network for me with -O2. Is there some reason why -O2 isn't okay, or is it slower for you on -O2 also? Maybe it's a difference in machine architecture?

Rex Kerr 2010-05-08 01:22:47

@Rex: O2 is indeed also the best option for me with your program. Cycle count is around 950 (I launch program several times as result is never perfectly stable). Thus it is faster than the first Network Sort implementation (the one without branchless swap) but slower than the other two. But you are right, target architecture or exact processor model can make a difference. 400 cycles is not a big difference. My testing target is an Intel Core2 Quad [email protected], stepping 0a (though with testing method frequency should be irrelevant).

kriss 2010-05-08 06:22:04

@Rex: I also wonder if your method is really working on every dataset. I wonder if you do not have cases where several values are mapped to the same place when sorted data are repeated.

kriss 2010-05-08 06:34:42

@Rex: sorry, I missed the > vs >= pattern at first sight. It works in every case.

kriss 2010-05-08 06:41:49

@Rex: I tried your code on my other test machine (Intel Core 2 E8400 @ 3GHz with native Linux 64bits OS) and on it your program is the fastest (~370 cycles vs ~390). I should edit my question to provide results for both architectures (with your answer).

kriss 2010-05-08 07:00:03

@kriss: I think a factor of 2 difference in cycles is quite large, especially since I was testing on a 2-core machine of the same vintage as the Q8300!

Rex Kerr 2010-05-08 18:32:54

@Rex; I updated my answer. The true reason was version of compiler (gcc441 vs gcc 443) not target architecture. I didn't identified exactly what optimization. Your solution seems to hard push gcc. For example gcc443 yield much better results with O1 than with O2). I guess I will have to look at assembly code if I really want to understand why.

kriss 2010-05-09 21:15:19

@kriss: Aha. That is not completely surprising--there are a lot of variables floating around, and they have to be carefully ordered and cached in registers and so on.

Rex Kerr 2010-05-09 22:32:17

ansaurus

tags:

views:

answers:

Fastest sort of fixed length 6 int array

Raw results

Comments on proposed solutions

related questions