tags:

views:

659

answers:

7

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.

Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?

+12  A: 

Yes. http://kerneltrap.org/node/4705

The __builtin_expect is a method that gcc (versions >= 2.96) offer for programmers to indicate branch prediction information to the compiler. The return value of __builtin_expect is the first argument (which could only be an integer) passed to it.

if (__builtin_expect (x, 0))
                foo ();

     [This] would indicate that we do not expect to call `foo', since we
     expect `x' to be zero.
Drakosha
Perfect, thanks!
Andy Shulman
In Microsoft environments, if statements are predicted to be always true. Some versions do have Profile Guided Optimisation.
Charles Beattie
A: 

No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.

Also, obligatory comment about premature optimization and how it's evil.

EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.

rlbond
Thank you Mr. Knuth. If this weren't a competition to see whose solution ran the absolute fastest, I would completely agree.
Andy Shulman
If you need every single cycle, why not just use inline assembly?
rlbond
The full quote: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. **A good programmer will not be lulled into complacency by such reasoning**, he will be wise to look carefully at the critical code; but only after that code has been identified." (emphasis mine)
Roger Pate
Required to use C! I just love inane requirements. However, I would probably go insane. Writing a highly optimized malloc/free/realloc is difficult enough in C, but in assembly? Heh...
Andy Shulman
The branch predictor has a static rule when it knows nothing about a branch: take backwards branches, don't take forwards branches. If you think about how a for loop works you'll understand why that makes sense, since you jump back to the top of the loop many more times than you don't. So what the GCC macro is controlling is how GCC lays out the opcodes in memory, so that the forward / backward branch prediction rule is most effective.
Don Neufeld
This is plain wrong, there is actually an assembly command for letting the branch predictor know. It is ignored on all architectures but the Netburst, though.
drhirsch
@don.neufeld: The static rule you mentioned exists only on older architectures, for Core2 and up the direction of a cold jump is pseudo-random, because it depends on an overwritten entry in the history table.
drhirsch
A: 

This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.

Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.

EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

Peter
Stuck using this version of GCC, with flags -m32 -Wall -O2.
Andy Shulman
No there can be valid use cases for this. For example there are compilers which output to c as immediate code and put a "if (break) break_into_debugger()" on each line to provide a platform independent debugging solution.
Lothar
Actually on deeply pipelined processors branch prediction errors are extremely expensive, since they require a complete pipeline flush. 20x as expensive as an instruction execution is a reasonable estimate. If his benchmarks are telling him that he's got a problem with branch prediction then he's doing the right thing. VTune gives you very good data on this btw, if you haven't tried it.
Don Neufeld
A: 

SUN C Studio has some pragmas defined for this case.

#pragma rarely_called ()

This works if one part of a conditional expression is a function call or starts with a function call.

But there is no way to tag a generic if/while statement

Lothar
+8  A: 

Like Drakosha says, telling gcc which branch is the common case, so it generates better code for the case where the branch predictor is cold, and so the fast path through the function is easy for the CPU to execute, is probably very useful.

FYI, Pentium 4 had branch-predictor hints as prefixes to the jcc instructions, but only the netburst microarchitecture ever did anything with them. See http://ref.x86asm.net/geek32.html. And Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.

Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now). What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.

Peter Cordes
BTW, you probably don't need to use builtin_expect if you use profile-guided optimization. PGO records which way each branch went, so when you compile with -fprofile-use, gcc knows which case is the common one for every branch. It still doesn't hurt to use builtin_expect to tell it the fast path, in case your code will be built without PGO, though.
Peter Cordes
+6  A: 

Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.

There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1. This opcode is ignored on all newer architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.

There are multiple reasons it doesn't work as expected.

  • The processor can predict small loops (n<64) perfectly.
  • The processor can predict small repeating patterns (n~7) perfectly.
  • The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
  • The predictability (= probability a branch will get predicted correctly) of a branch is far more important that the probability that the branch is taken. Unfortunatly this is highly architecture dependent, and the to predict the predictability of branch is notoriously hard.

Read more about the inner works of the branch prediction at Agner Fogs manuals. See also the gcc mailing list.

drhirsch
Would be nice if you could quote / point to the exact portion where it says that the hint is ignored on newer architectures.
int3
Chapter 3.12 "Static Prediction" in the link I gave.
drhirsch
A: 

I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.

Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.

I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.

Thomas Matthews