views:

478

answers:

8

Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I've seen a few techniques on very isolated threads but never a list. The ones I know already fix situations where the condition can be turned to a bool and that 0/1 is used in some way to change. Are there other conditional branches that can be avoided?

e.g. (pseudocode)

loop () {
  if (in[i] < C )
    out[o++] = in[i++]
  ...
}

Can be rewritten, arguably losing some readability, with something like this:

loop() {
  out[o] = in[i]  // copy anyway, just don't increment
  inc = in[i] < C  // increment counters? (0 or 1)
  o += inc
  i += inc
}

Also I've seen techniques in the wild changing && to & in the conditional in certain contexts escaping my mind right now. I'm a rookie at this level of optimization but it sure feels like there's got to be more.

+3  A: 

The generalization of the example you give is "replace conditional evaluation with math"; conditional-branch avoidance largely boils down to that.

What's going on with replacing && with & is that, since && is short-circuit, it constitutes conditional evaluation in and of itself. & gets you the same logical results if both sides are either 0 or 1, and isn't short-circuit. Same applies to || and | except you don't need to make sure the sides are constrained to 0 or 1 (again, for logic purposes only, i.e. you're using the result only Booleanly).

chaos
+2  A: 

GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.

Additionaly to compute minimum you can use (see these magic tricks):

min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))

However, pay attention to things like:

c[i][j] = min(c[i][j], c[i][k] + c[j][k]);   // from Floyd-Warshal algorithm

even no jumps are implied is much slower than

int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
    c[i][j] = tmp;

My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.

Alexandru
Note that `cmov` has the disadvantage of being considered as depending on its source operand from the point of view of instruction reordering and parallel execution. For a condition that is often false, a well-predicted conditional jump may be faster than a stalling `cmov`.
Pascal Cuoq
+1  A: 

In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.

Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?

Michael Burr
True. That's why the assembly tag. If you have techniques in assembly for this kind of optimization it would be much appreciated if you can share (links too!)
alecco
I'm not sure there's much I can share - my assembly is mostly on the reading side (when debugging) or doing hardware level stuff that can't be done in C (not optimization) on embedded systems. One thing that pops into my head is ARM specific,and not much of a trick. ARM instructions have a field to allow them to be executed conditionally, so instead of having to jump around them they effectively become NOPs with no effect on the instruction pipeline.
Michael Burr
+1  A: 

This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.

Mike Dunlavey
I think the example in the question is quite real and far from guessing. In fact it's right there in this code. This is of course for the innermost components of tight loops for compressing/sorting/searching, so it is definitely a hotspot. It's not optimizing hello-world just for kicks. Thanks.
alecco
@aleccolocco: Here's what I mean. Pick a real program, not one that's created just to ask a question. Do some performance tuning on it, to really wring it out. Issues like branch-prediction don't come in until everything else is exhausted, so starting with the assumption that they really matter is not based on knowing what the problems actually are.http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773
Mike Dunlavey
... at the same time, when you do get down to hotspots like that, you are right, they can make a difference. (I'm sorry. To me it's a hot-button issue that many people seem to think optimization starts and ends at the low level, when that's only the tip of the iceberg.)
Mike Dunlavey
@MikeDunlavey Yes, indeed. Also there are more obscure performance penalties like page-splits or cache-line-splits. But I know how to handle those already (and preventive meassures are already in the design.) Cheers.
alecco
+1  A: 

At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.

Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:

  • Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.

  • Performance counters or other tools that tell you where to find branch mispredictions are indispensible.

  • If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:

    • Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.

    • Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with

  • I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.

Norman Ramsey
Cool, thanks!I'm doing SIMD compression, sorting and searching on large data sets. It makes a difference when the probability is about .5 (that's why that's in the question at the beginning.) Well, save Itanium or architectures like that, but that's not my case.The nature of the data will vary significantly as it's not specialized for a kind of dataset (it could be random, incremental, etc.) So feedback will help but up to a point. And there are many cases like the example in the question that can be easily solved without even diving into assembly. That's my quest :)
alecco
+1  A: 

Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.

You're better off worrying about cache hits and misses.

BobbyShaftoe
I've found that--at least in some cases--the solution to branch prediction misses is often also better for cache performance. It can be a win-win.
Adrian McCarthy
+4  A: 

I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.

As someone else has mentioned, if you want to do more than unrolling loops, and providing branch hints, you're going to want to drop into assembly. Of course this should be done with utmost caution: your typical compiler can write better assembly in most cases than a human. Your best hope is to shave off rough edges, and make assumptions that the compiler cannot deduce.

Here's an example of the following C code:

if (b > a) b = a;

In assembly without any jumps, by using bit-manipulation (and extreme commenting):

sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0

Note that while conditional moves are immediately jumped on by assembly enthusiasts, that's only because they're easily understood and provide a higher level language concept in a convenient single instruction. They are not necessarily faster, not available on older processors, and by mapping your C code into corresponding conditional move instructions you're just doing the work of the compiler.

Matt Joiner
+1  A: 

An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.

if (a) {
  if (b) {
    result = q;
  } else {
    result = r;
  }
} else {
  if (b) {
    result = s;
  } else {
    result = t;
  }
}

If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:

static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];

You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.

Adrian McCarthy