views:

374

answers:

8

I just stumbled upon a change that seems to have counterintuitive performance ramifications. Can anyone provide a possible explanation for this behavior?

Original code:

for (int i = 0; i < ct; ++i) {
    // do some stuff...

    int iFreq = getFreq(i);
    double dFreq = iFreq;

    if (iFreq != 0) {
        // do some stuff with iFreq...
        // do some calculations with dFreq...
    }
}

While cleaning up this code during a "performance pass," I decided to move the definition of dFreq inside the if block, as it was only used inside the if. There are several calculations involving dFreq so I didn't eliminate it entirely as it does save the cost of multiple run-time conversions from int to double. I expected no performance difference, or if any at all, a negligible improvement. However, the perfomance decreased by nearly 10%. I have measured this many times, and this is indeed the only change I've made. The code snippet shown above executes inside a couple other loops. I get very consistent timings across runs and can definitely confirm that the change I'm describing decreases performance by ~10%. I would expect performance to increase because the int to double conversion would only occur when iFreq != 0.

Chnaged code:

for (int i = 0; i < ct; ++i) {
    // do some stuff...

    int iFreq = getFreq(i);

    if (iFreq != 0) {
        // do some stuff with iFreq...
        double dFreq = iFreq;
        // do some stuff with dFreq...
    }
}

Can anyone explain this? I am using VC++ 9.0 with /O2. I just want to understand what I'm not accounting for here.

+6  A: 

Maybe the result of getFreq is kept inside a register in the first case and written to memory in the second case? It might also be, that the performance decrease has to do with CPU mechanisms as pipelining and/or branch prediction. You could check the generated assembly code.

MartinStettner
+2  A: 

maybe the compiler is optimizing it taking the definition outside the for loop. when you put it in the if the compiler optimizations aren't doing that.

John Boker
+1  A: 

There's a likelihood that this changed caused your compiler to disable some optimizations. What happens if you move the declarations above the loop?

mxg
+1  A: 

Once I've read a document about optimization that said that as defining variables just before their usage and not even before was a good practice, the compilers could optimize code following that advice.

This article (a bit old but quite valid) say (with statistics) something similar : http://www.tantalon.com/pete/cppopt/asyougo.htm#PostponeVariableDeclaration

Klaim
+3  A: 

Try moving the definition of dFreq outside of the for loop but keep the assignment inside the for loop/if block.

Perhaps the creation of dFreq on the stack every for loop, inside the if, is causing issue (although the compiler should take care of that). Perhaps a regression in the compiler, if the dFreq var is in the four loop its created once, inside the if inside the for its created every time.

double dFreq;
int iFreq;
for (int i = 0; i < ct; ++i) 
{
    // do some stuff...

    iFreq = getFreq(i);

    if (iFreq != 0) 
    {
        // do some stuff with iFreq...
        dFreq = iFreq;
        // do some stuff with dFreq...
    }
} 
gbrandt
+8  A: 

You should put the conversion to dFreq immediately inside the if() before doing the calculations with iFreq. The conversion may execute in parallel with the integer calculations if the instruction is farther up in the code. A good compiler might be able to push it farther up, and a not-so-good one may just leave it where it falls. Since you moved it to after the integer calculations it may not get to run in parallel with integer code, leading to a slowdown. If it does run parallel, then there may be little to no improvement at all depending on the CPU (issuing an FP instruction whose result is never used will have little effect in the original version).

If you really want to improve performance, a number of people have done benchmarks and rank the following compilers in this order:

1) ICC - Intel compiler 2) GCC - A good second place 3) MSVC - generated code can be quite poor compared to the others.

You may also want to try -O3 if they have it.

phkahler
Those diagrams showing integer and FP execution units make it clear that running a mix of instructions types will get things done faster. You could try interleaving the iFreq and dFreq lines of code too.
phkahler
@STingRaySC: Some can! It is, unfortunately, a hard problem to do well. Even though people focus on MSVC and GCC (and LLVM-related recently), there are *many* C++ compilers. Intel was also mentioned; and http://comeaucomputing.com/ is worth looking into. Also see http://www.infoq.com/presentations/click-crash-course-modern-hardware.
Roger Pate
Comeau compilers your C++ to C; it will not generate assembly for you. As a result, it can't do these kinds of transforms.
MSalters
+1  A: 

It's easy enough to find out. Just take 20 stackshots of the slow version, and of the fast version. In the slow version you will see on roughly 2 of the shots what it is doing that it is not doing in the fast version. You will see a subtle difference in where it halts in the assembly language.

Mike Dunlavey
+4  A: 

This looks to me like a pipeline stall

int iFreq = getFreq(i);
    double dFreq = iFreq;

    if (iFreq != 0) {

Allows the conversion to double to happen in parallel with other code since dFreq is not being used immediately. it gives the compiler something to do between storing iFreq and using it, so this conversion is most likely "free".

But

int iFreq = getFreq(i);

if (iFreq != 0) {
    // do some stuff with iFreq...
    double dFreq = iFreq;
    // do some stuff with dFreq...
}

Could be hitting a store/reference stall after the conversion to double since you begin using the double value right away.

Modern processors can do multiple things per clock cycle, but only when the things are independent. Two consecutive instructions that reference the same register often result in a stall. The actual conversion to double may take 3 clocks, but all but the first clock can be done in parallel with other work, provided you don't refer to the result of the conversion for an instruction or two.

C++ compilers are getting pretty good at re-ordering instructions to take advantage of this, it looks like your change defeated some nice optimization.

One other (less likely) possibility is that when the conversion to float was before the branch, the compiler was able remove the branch entirely. Branchless code is often a major performance win in modern processors.

It would be interesting to see what instructions the compiler actually emitted for these two cases.

John Knoeller
It depends on what the code in the branch does. There are clever ways to do non-branching code for some operations. You didn't show us what the if did, so its hard to know if a branchless solution was possible.
John Knoeller