ansaurus

Question

Answer 1

+6 A:

Maybe the result of getFreq is kept inside a register in the first case and written to memory in the second case? It might also be, that the performance decrease has to do with CPU mechanisms as pipelining and/or branch prediction. You could check the generated assembly code.

MartinStettner 2010-02-05 19:10:23

Answer 2

+2 A:

maybe the compiler is optimizing it taking the definition outside the for loop. when you put it in the if the compiler optimizations aren't doing that.

John Boker 2010-02-05 19:10:33

Answer 3

+1 A:

There's a likelihood that this changed caused your compiler to disable some optimizations. What happens if you move the declarations above the loop?

mxg 2010-02-05 19:12:40

Answer 4

+1 A:

Once I've read a document about optimization that said that as defining variables just before their usage and not even before was a good practice, the compilers could optimize code following that advice.

This article (a bit old but quite valid) say (with statistics) something similar : http://www.tantalon.com/pete/cppopt/asyougo.htm#PostponeVariableDeclaration

Klaim 2010-02-05 19:15:07

Answer 5

+3 A:

Try moving the definition of dFreq outside of the for loop but keep the assignment inside the for loop/if block.

Perhaps the creation of dFreq on the stack every for loop, inside the if, is causing issue (although the compiler should take care of that). Perhaps a regression in the compiler, if the dFreq var is in the four loop its created once, inside the if inside the for its created every time.

double dFreq;
int iFreq;
for (int i = 0; i < ct; ++i) 
{
    // do some stuff...

    iFreq = getFreq(i);

    if (iFreq != 0) 
    {
        // do some stuff with iFreq...
        dFreq = iFreq;
        // do some stuff with dFreq...
    }
}

gbrandt 2010-02-05 19:23:19

Answer 6

+8 A:

You should put the conversion to dFreq immediately inside the if() before doing the calculations with iFreq. The conversion may execute in parallel with the integer calculations if the instruction is farther up in the code. A good compiler might be able to push it farther up, and a not-so-good one may just leave it where it falls. Since you moved it to after the integer calculations it may not get to run in parallel with integer code, leading to a slowdown. If it does run parallel, then there may be little to no improvement at all depending on the CPU (issuing an FP instruction whose result is never used will have little effect in the original version).

If you really want to improve performance, a number of people have done benchmarks and rank the following compilers in this order:

1) ICC - Intel compiler 2) GCC - A good second place 3) MSVC - generated code can be quite poor compared to the others.

You may also want to try -O3 if they have it.

phkahler 2010-02-05 19:32:17

Those diagrams showing integer and FP execution units make it clear that running a mix of instructions types will get things done faster. You could try interleaving the iFreq and dFreq lines of code too.

phkahler 2010-02-05 20:34:26

@STingRaySC: Some can! It is, unfortunately, a hard problem to do well. Even though people focus on MSVC and GCC (and LLVM-related recently), there are *many* C++ compilers. Intel was also mentioned; and http://comeaucomputing.com/ is worth looking into. Also see http://www.infoq.com/presentations/click-crash-course-modern-hardware.

Roger Pate 2010-02-07 03:17:59

Comeau compilers your C++ to C; it will not generate assembly for you. As a result, it can't do these kinds of transforms.

MSalters 2010-02-08 11:51:28

Answer 7

+1 A:

It's easy enough to find out. Just take 20 stackshots of the slow version, and of the fast version. In the slow version you will see on roughly 2 of the shots what it is doing that it is not doing in the fast version. You will see a subtle difference in where it halts in the assembly language.

Mike Dunlavey 2010-02-05 19:46:44

Answer 8

+4 A:

This looks to me like a pipeline stall

int iFreq = getFreq(i);
    double dFreq = iFreq;

    if (iFreq != 0) {

Allows the conversion to double to happen in parallel with other code since dFreq is not being used immediately. it gives the compiler something to do between storing iFreq and using it, so this conversion is most likely "free".

But

int iFreq = getFreq(i);

if (iFreq != 0) {
    // do some stuff with iFreq...
    double dFreq = iFreq;
    // do some stuff with dFreq...
}

Could be hitting a store/reference stall after the conversion to double since you begin using the double value right away.

Modern processors can do multiple things per clock cycle, but only when the things are independent. Two consecutive instructions that reference the same register often result in a stall. The actual conversion to double may take 3 clocks, but all but the first clock can be done in parallel with other work, provided you don't refer to the result of the conversion for an instruction or two.

C++ compilers are getting pretty good at re-ordering instructions to take advantage of this, it looks like your change defeated some nice optimization.

One other (less likely) possibility is that when the conversion to float was before the branch, the compiler was able remove the branch entirely. Branchless code is often a major performance win in modern processors.

It would be interesting to see what instructions the compiler actually emitted for these two cases.

John Knoeller 2010-02-05 21:25:16

It depends on what the code in the branch does. There are clever ways to do non-branching code for some operations. You didn't show us what the if did, so its hard to know if a branchless solution was possible.

John Knoeller 2010-02-05 21:38:54

ansaurus

tags:

views:

answers:

Strange C++ performance difference?

related questions