views:

177

answers:

7

If you're optimizing for an architecture on which branching is expensive (say the PS3's cell processor), it can be important to be able to determine whether or not you can express a given algorithm without using branches or at least using fewer branches. One pattern that I see a lot in unoptimized code is a bunch of if's used to tweak an index into some array (if the size of the array is odd, bump the index by 1, under some other circumstance, multiply by 2, etc.). So it would be nice if there were a way, given two lists of numbers, to be able to determine whether or not it's possible to write a branchless function transforming one list into the other.

For example I recently wanted to know if it was possible to write a branchless function that would transform: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to: 0, 2, 4, 6, 8, 9, 7, 5, 3, 1 (ascending even followed by descending odd). Technically, I could write a big switch/case function but obviously I'm interested in a function that will follow the pattern for arbitrary sizes. Writing a function to do this transformation is straightforward with branching, but if there's a nonbranching way to do it it's not immediately obvious.

So is there a general approach to this sort of problem, or any sort of quick litmus test? Or do you have to come up with proofs on a case by case basis? I could work harder at these sorts of problems but that's pointless if they're literally impossible. I seem to recall reading at some point that there's a formal mathematical word for functions that only use arithmetic without branching, but I can't recall.

A: 

If speed is really of the essence, couldn't you write out the instructions for lists upto a certain length? (One could pre-generate this code of course).

so:

 void algorithm1_Length6(int *srcList, int *destList)
 {
      *destList++ = *srcList;
      *destList++ = srcList[2];
      *destList++ = srcList[4];
      *destList++ = srcList[5];
      *destList++ = srcList[3];
      *destList++ = srcList[1];
 }

and all the other variations upto a certain length.

Toad
"Technically, I could write a big switch/case function but obviously I'm interested in a function that will follow the pattern for arbitrary sizes."
Joseph Garvin
A: 

For a given array you can use method like this:

 void tranform(int[] src, int[] dest) {
        //0, 2, 4, 6, 8, 9, 7, 5, 3, 1
        dest[0] = src[0];
        dest[1] = src[2];
        dest[2] = src[4];
        dest[3] = src[6];
        dest[4] = src[8];
        dest[5] = src[9];
        dest[6] = src[7];
        dest[7] = src[5];
        dest[8] = src[3];
        dest[9] = src[1];
    }

But in general for big arrays it's hard to write such methods, therefore it will be useful if you write generator method like this:

static void createFunction(int[] src, int[] dest) {
        System.out.println("void tranform(int[] src, int[] dest) {");
        for (int i = 0; i < dest.length; i++) {
            for (int j = 0; j < src.length; j++) {
                if (dest[i] == src[j]) {
                    System.out.println("dest[" + i + "]=src[" + j + "];");
                    break;
                }
            }
        }
        System.out.println("}");
    }

invoke it with your arrays: createFunction(new int[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, new int[]{0, 2, 4, 6, 8, 9, 7, 5, 3, 1});

And paste output of this method to your program.

rachvela
I specifically mentioned I was looking for functions that could handle input of varying size. Making a function that can handle varying size that generates functions that can't is a creative workaround but not what I'm looking for. Even ignoring it's hackishness, if optimization is a goal this would definitely not be a good solution.
Joseph Garvin
+1  A: 

If you're optimizing for the PS3 in particular, the Power PC Compiler Writers Guide has techniques on branchfree code in Section 3.1.5 and it has GNU Superoptimizer sequences for branchfree code in Appendix D.

You may be interested in Mike Acton's Cell Performance Blog as well.

Adisak
FWIW, there should be GNU Superoptimizer sequences for X86 as well for many common cases if you look for them.
Adisak
+1  A: 

If you plot your desired indexes against your input indexes, you get a triangular shaped function. It turns out that for your n=10 case, it is

9.5 - abs(2 (x - 4.75))

Therefore, for general n, it would be

n-0.5 - abs(2*(x - n/2-0.25))

Or in integer form,

(2*n-1 - abs(4*x - 2*n + 1)) / 2

This is completely branchless in that your output indexes are generated with a single mathematecal function. I think the general approach would be to plot out the desired indexes and look for a pattern and a way of representing it with math functions.

Obviously if your desired final indexes form a straight line, then the transformation is simple. If you have a kink in the mapping, then you want to use the absolute value function to introduce a bend, and you can tune the scaling to change the angle of the bend. You can tilt the kink by biasing it (e.g. abs(x)+x/2). If you need a jump discontinuity in your final index function, then use the sign function (hopefully builtin or use abs(x)/x). You need to be creative with how to use the graphs of common functions to your advantage here.


Addendum

If your indexing function is piecewise linear, there is a straightforward algorithm. Assume the desired index function is expressed as a list of segments

{(sx1,sy1)-(ex1,ey1), (sx2,sy2)-(ex2,ey2), ... , (sxN,syN)-(exN,eyN)}
 segment 1            segment 2                   segment N

where exK > sxK for all K and sxK > sx(K-1) for all K (put them from left to right).

k = 1
f(x) = Make affine model of segment k
g(x) = f(x)
Do:
   k = k + 1
   h(x) = Makeaffine model of segment k
   If g(x) and h(x) intersect between ex(k-1) and ex(k)
       f(x) = f(x) + [slope difference of g(x) and h(x)] * ramp(x)
   Else
       f(x) = f(x) + (h(ex(k-1)) - f(ex(k-1))) * step(x)
       f(x) = f(x) + [slope difference of g(x) and h(x)] * ramp(x)

where ramp(x) = (abs(x)+x)/2 and step(x) = (sign(x)+1)/2. f(x) is meant to represent the desired function, g(x) is the last segment's affine model, and h(x) is the current segment's affine model. An affine model is just a line in slope-offset form: a*x+b, and the slope difference is the difference of slopes. This algorithm simply proceeds from left right, adding in the proper pieces of functions as it goes along. The functions it adds on are always zero for x <= 0 so they don't affect the f(x) that has been built up so far.

Of course, there may be some bugs/typos in the above. I really have to get to a meeting, so I can't write any more.

Victor Liu
Can you calculate abs() without a branch?
Drew Hall
@Drew Hall: That is almost certainly done in hardware.
Amuck
Abs is not always in hardware for integers but `fabs` is a common FPU operation. For int's you can do branchless `int Abs(int A) { int Sign = A >> 31; return ( A ^ Sign ) - Sign; }`
Adisak
FWIW, converting from INT to FLOAT and back is going to be very slow on PS3 due to LHS (Load Hit Store) Penalty. This would be much slower than a branching version on the target platform for the Question.
Adisak
The piecewise linear addendum is definitely not going to solve the branchfree requirements of the Question if you do it in a loop. To do that branchfree, you have to compute all values simultaneously and then do a branchfree select. You should look to collapse common expressions of the computation values as well when doing branchfree optimization.
Adisak
The point of the algorithm is to generate code, not to be used literally AS the code.
Victor Liu
+3  A: 

Transform: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to: 0, 2, 4, 6, 8, 9, 7, 5, 3, 1 (ascending even followed by descending odd).

Simple: given the sequence of N values of X from 0 to N-1, we see the first half of the sequence is 2X. The second half of the sequence is (2N-1)-2X. The sequence splits at X=(N+1)/2 with "integer" math. In the above example, N == 10.

So assuming 32-bit signed ints with arithmetic shift right:

int Transform(int x)
{
    const int seq1=x+x;
    const int mask=(x-((N+1)>>1))>>31;
    const int seq2=(N+N-1)-seq1;
    return (mask&seq1)|((~mask)&seq2);
}

Note that the mask pattern used here is fast because PowerPC has an ANDC (and with complement) that makes (~mask) a free operation.

Adisak
Could you explain _how_ you came up with that? I'm more looking for insight into general ways of handling these problems then just getting handed an off the shelf solution for one instance, though I appreciate it just the same :)
Joseph Garvin
I noted there are two sequences... odd and even. I derived the integer formula for each and then figured a mask condition. Any time you need to pick between two conditions and there is a nicely defined boundary, it's fairly easy to make a branchless representation.
Adisak
It's when you start having more than two conditions to chose from that determining the boundary conditions and the mask variables becomes difficult.
Adisak
+1  A: 

You can always write a polynomial formula using Lagrange interpolation for instance. Not pretty (or particularly fast) but it won't have any branches.

lhf
A: 

Technically any series of operations can be executed without "branching" using a state machine that utilizes boolean operations. The concept of branching is because most programs are a series of instructions executed by a program counter that can go one way or the other.

Even if you're talking about a pure functional approach which is stateless, for a finite set of discrete values, you can always (at the cost of large amounts of memory) use a lookup table.

Jason S