views:

835

answers:

14

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"

int a, b, low;
if (a < b) low = a;
else low = b;

is more or less efficient than this:

int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));

(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).

I'm just wondering which one of these would be more efficient (or if the difference is to miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

+5  A: 

Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.

I would profile the application an concentrate your optimization efforts to something more worthwhile.

Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.

For simple statements like this, I find the ternary operator very intuitive:

low = (a < b) ? a : b;

Clear and concise.

Ben S
x86 can map a comparison result to 0/1 without a jump.
Mark Ransom
Where is the conditional jump in `low = b + ((a - b) `
Graphics Noob
I must be missing something, why will there be a conditional jump in his second example?
Torlack
I read it as a logical and for some reason, disregard my conditional comment, editing...
Ben S
I made the same mistake too when I first read the message.
Torlack
Nanoseconds, not picoseconds. Most processors operate at only the GHz clock range.
Crashworks
+1  A: 

Don't worry about your computer, it will execute them both far faster than you should care for. Sometimes the compiler will generate an identical code for both alternatives. The question you should really spend your time pondering is, 3 months from now, how fast can you read and understand the two snippets? Or, right now, how quickly can your (human) friend understand what you're doing in either of the two?

wilhelmtell
+20  A: 

(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)

One advantage of eliminating an if statement is that you avoid branch prediction penalties.

Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like

a = random() % 10
if (a < 5)
  print "Less"
else
  print "Greater"

then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.

One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.

So

int a, b, low;
if (a < b) low = a;
else low = b;

becomes

int a, b, low;
low = (a < b) ? a : b

and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.

Of course, this is a micro-optimization which is unlikely to have significant impact on your code.

danben
Finally, an answer that isn't bleating about premature optimization. Thank you.
Justicle
Michael Burr
@BlueRaja - prove it! Answer the question and show your working.
Justicle
At `-O1` and higher, gcc produces identical code for the if statement and the ternary operator for the min() function, using a cmovg instruction in both cases. At `-O0`, it uses branches and labels for the if statement and cmovle for the ternary operator.
Adam Rosenfield
I agree that this is more readable, but it will certainly not be faster. See my answer.
BlueRaja - Danny Pflughoeft
"and in the second case a branching instruction is not necessary." +1 for binning in the branching overhead into the picture. Otherwise one has to keep in mind that modern CPUs (Intel ones included) have "conditional" operations which remove need for branching in trivial cases like that.
Dummy00001
+1  A: 

Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.

My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.

Jesse J
+11  A: 

Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.

Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.

Bill the Lizard
Your simple answer is wrong for some processors.
nategoose
@nategoose: Which processors?
Bill the Lizard
@Bill: many processors have a long instruction pipeline which must be flushed whenever there's a mispredicted branch, taking perhaps 10 or 20 cycles. In this case, the branch is likely to be mispredicted half the time, so the conditional version might take an average of 5 or 10 cycles, while the squiggly version takes 4 or 5. (Of course, other processors have conditional instructions, short pipelines and other ways to avoid misprediction, and then the conditional version will be faster).
Mike Seymour
And on the processor I mostly use, the first version takes 2 cycles, and the second takes 3.
Mike Seymour
Crashworks
@nategoose, @Mike, @Crashworks: Well, that will teach me to make sweeping generalizations based on benchmarks from one machine. I stand corrected.
Bill the Lizard
@Bill the Lizard: The most extreme example would be most GPUs, I believe. On the other hand if memory serves there are some (but not all) conditional branch instructions on the AVR microcontrollers that take 3 cycles, which would take the same number of cycles as an 8 bit add, bitwise and, and shift by 1.
nategoose
+1  A: 

Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?

The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.

vpit3833
+2  A: 

One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.

Torlack
+5  A: 

The biggest problem is that your second example won't work on 64-bit machines.

However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower

There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.

BlueRaja - Danny Pflughoeft
I've looked at the assembly output of MSVC and GCC, and neither of them seem smart enough to emit branchless conditional moves half the time I want them to.
Crashworks
@Crashworks: That means the compiler decided the branchless conditional is actually slower (branchless conditionals require more clocks, but don't have the possibility of clearing the instruction pipeline)
BlueRaja - Danny Pflughoeft
The second example won't even work on all 32 bit machines, since the behaviour of the `>>` operator on negative values is implementation-defined (it's allowable for the compiler to implement it with a logical shift rather than arithmetic shift).
caf
Yes, but the compiler was wrong when it decided that. I've timed both pathways. My job consists of cramming more work into 16.6 milliseconds than the competing product can.In general, I have seen compilers emit many suboptimal code sequences. They are not perfect.
Crashworks
@Crashworks: In that case you should take your most critical portions and rewrite them in highly-optimized assembly
BlueRaja - Danny Pflughoeft
I sometimes do, but it's often easier to meet the compiler halfway and write code in such a way that it results in the code sequence I want; intrinsics in particular are an example of this. That's much easier to intermingle with other C++ code than inline assembly. It's a common practice in the embedded world; part of the job is learning what the compiler will emit for particular inputs.
Crashworks
@Crashworks: the scenario you're describing seems like it's one that's actually worthy of worrying about the optimization. The other 99+% of the cases there's no value in it, so the code should be written to be and correct and readable (which may well still involve the conditional operator).
Michael Burr
In practice I wrote an `isel(a,b,c)` function which has the same effect as `return a >= 0 ? b : c` . We just use that. (It was named by analogue to the `fsel` intrinsic, which is the hardware's native floating point conditional-move.) It would be better if the compiler were just smart enough to emit the right code for `?:`, but we haven't got a smart compiler, just GCC.
Crashworks
+5  A: 

Like with any low-level optimization, test it on the target CPU/board setup.

On my compiler (gcc 4.5.1 on x86_64), the first example becomes

cmpl    %ebx, %eax
cmovle  %eax, %esi

The second example becomes

subl    %eax, %ebx
movl    %ebx, %edx
sarl    $31, %edx
andl    %ebx, %edx
leal    (%rdx,%rax), %esi

Not sure if the first one is faster in all cases, but I would bet it is.

Cubbi
+3  A: 

For something as simple as this, why not just experiment and try it out?

Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.

I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.

Michael
+1  A: 

profile results with gcc -o foo -g -p -O0, Solaris 9 v240

 %Time Seconds Cumsecs  #Calls   msec/call  Name
  36.8    0.21    0.21 8424829      0.0000  foo2
  28.1    0.16    0.37       1    160.      main
  17.5    0.10    0.4716850667      0.0000  _mcount
  17.5    0.10    0.57 8424829      0.0000  foo1
   0.0    0.00    0.57       4      0.      atexit
   0.0    0.00    0.57       1      0.      _fpsetsticky
   0.0    0.00    0.57       1      0.      _exithandle
   0.0    0.00    0.57       1      0.      _profil
   0.0    0.00    0.57    1000      0.000   rand
   0.0    0.00    0.57       1      0.      exit

code:

int
foo1 (int a, int b, int low)        
{
   if (a < b) 
      low = a;   
   else 
      low = b;         
   return low;
}

int                       
foo2 (int a, int b, int low)
{
   low = (a < b) ? a : b;
   return low;
}


int main()
{
    int low=0;
    int a=0;
    int b=0;
    int i=500;
    while (i--)
    {
       for(a=rand(), b=rand(); a; a--)
       {
           low=foo1(a,b,low);
           low=foo2(a,b,low);
       }        
    }
    return 0;
}

Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct

jim mcnamara
However, compiling here, `gcc -O2 -S -o output.S input.c`, `foo1` and `foo2` compile to exactly the same 4 instructions. (Linux, gcc 4.3.4, amd64 (core 2 duo))
Thanatos
That was the whole point and why "bleating" about profiling is meaningful. Thanks.
jim mcnamara
A: 

If it is for Gnu C++, try this

int min = i <? j;

I have not profiled it but I think it is definitely the one to beat.

Syd
I don't know what Gnu C++ is, but I don't like its syntax.
Dennis Zickefoose
Gnu C++ is of course the C++ compiler from GCC (the Gnu Compiler Collection). IIRD they've deprecated this form. Just use `std::min(i,j)`. It's unlikely GCC's `std::min<int>` is slower than this.
MSalters
+5  A: 

Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:

int foo1(int a, int b)
{
    int low;
    if (a < b) low = a;
    else low = b;
    return low;
}

int foo2(int a, int b)
{
    int low;
    low = b + ((a - b) & ((a - b) >> 31));
    return low;
}

I get:

foo1:
    cmpl    %edi, %esi
    cmovle  %esi, %edi
    movl    %edi, %eax
    ret

foo2:
    subl    %esi, %edi
    movl    %edi, %eax
    sarl    $31,  %eax
    andl    %edi, %eax
    addl    %esi, %eax
    ret

...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.

Thanatos
A: 

I'm just wondering which one of these would be more efficient (or if the difference is to miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)

On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)

Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.

But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.

The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.

Dummy00001