views:

137

answers:

3

bug in my gcc? bug in my code? both?

http://files.minthos.com/code/speedtest_doubles_wtf.cpp

Somehow, it manages to "optimize" a function that results in the array of doubles being zeroed out into taking 2.6 seconds on my q6600, instead of the 33 ms the more complex function takes to fill the array with something somewhat meaningful.

I'd be interested in knowing if others get similar results, and if so, if anyone can explain what's going on.. And also figure out what causes the huge difference between integer and floating-point performance (especially when compiling without optimization).

+1  A: 

You're not resetting begin between benchmarks, so your timing numbers are difficult to interpret. Maybe this is the source of your confusion?

dhaffey
Yeah good suggestion, I don't think it's the problem but at least that makes the output easier to interpret.Code updated to reflect this.
Minthos
+4  A: 

Line 99:

memcpy(floats, ints, sizeof(floats));

is partially initializing floats[] effectively with floating point garbage. The rest remain zero. This stems from replacing the floats with integer bitmaps and then subsequently interpreting them as doubles. Perhaps the overflows and underflows are affecting performance? To test, I changed the random number seed to a constant 1000 for reproducibility and got these results:

[wally@zenetfedora Downloads]$ ./speedtest_doubles_wtf.cpp
no optimization
begin: 0.017000
floats: 27757.816000
ints: 28117.604000
floats: 40346.196000
ints: 41094.988000
sum: 7999999.998712
sum2: 67031739228347449344.000000
mild optimization
begin: 0.014000
floats: 68.574000
ints: 68.609000
floats: 147.105000
ints: 820.609000
sum: 8000000.000001
sum2: 67031739228347441152.000000
heavier optimization
begin: 0.014000
floats: 73.588000
ints: 73.623000
floats: 144.105000
ints: 1809.980000
sum: 8000000.000001
sum2: 67031739228347441152.000000
again, now using ffun2()
no optimization
begin: 0.017000
floats: 22720.648000
ints: 23076.134000
floats: 35480.824000
ints: 36229.484000
floats: 46324.080000
sum: 0.000000
sum2: 67031739228347449344.000000
mild optimization
begin: 0.013000
floats: 69.937000
ints: 69.967000
floats: 138.010000
ints: 965.654000
floats: 19096.902000
sum: 0.000000
sum2: 67031739228347441152.000000
heavier optimization
begin: 0.015000
floats: 95.851000
ints: 95.896000
floats: 206.594000
ints: 1699.698000
floats: 29382.348000
sum: 0.000000
sum2: 67031739228347441152.000000

Repeating after replacing the memcpy with a proper assignment so type conversion can occur should prevent floating point boundary conditions:

for(int i = 0; i < 16; i++)
{
    ints[i] = rand();
    floats[i]= ints[i];
}

The modified program, still with constant 1000 as random seed, provides these results:

[wally@zenetfedora Downloads]$ ./speedtest_doubles_wtf.cpp
no optimization
begin: 0.013000
floats: 35814.832000
ints: 36172.180000
floats: 85950.352000
ints: 86691.680000
sum: inf
sum2: 67031739228347449344.000000
mild optimization
begin: 0.013000
floats: 33136.644000
ints: 33136.678000
floats: 51600.436000
ints: 52494.104000
sum: inf
sum2: 67031739228347441152.000000
heavier optimization
begin: 0.013000
floats: 31914.496000
ints: 31914.540000
floats: 48611.204000
ints: 49971.460000
sum: inf
sum2: 67031739228347441152.000000
again, now using ffun2()
no optimization
begin: 0.014000
floats: 40202.956000
ints: 40545.120000
floats: 104679.168000
ints: 106142.824000
floats: 144527.936000
sum: inf
sum2: 67031739228347449344.000000
mild optimization
begin: 0.014000
floats: 33365.716000
ints: 33365.752000
floats: 49180.112000
ints: 50145.824000
floats: 80342.648000
sum: inf
sum2: 67031739228347441152.000000
heavier optimization
begin: 0.014000
floats: 31515.560000
ints: 31515.604000
floats: 47947.088000
ints: 49016.240000
floats: 78929.784000
sum: inf
sum2: 67031739228347441152.000000

This is an older PC, circa 2004, otherwise lightly loaded.

Looks like that made matters slower. Fewer zeroes to do arithmetic with perhaps? That is what many random bit patterns look like. Or values like 0.0000000000000000000000000382652. Once that is added to, say 0.1, the low bits tend to be removed.

wallyk
Wow, I had no idea random bits could do that to the performance! I'm baffled and amazed :)
Minthos
Since the integers are 64-bit (but containing 32-bit values), the `memcpy` is actually initialising all the doubles, with denormalised values.
Mike Seymour
denormalised math is known to be slower on at least some implementations.
MSalters
A: 

The random 64-bit integers all have zeros in the upper 32 bits, since rand() returns 32-bit values (at least for gcc on a 32-bit platform). So all the doubles will be denormalised, since the reinterpreted bit patterns of the integers will have zero for the exponent field. Adding 0.1 to a denormalised value gives a normalised value (very close to 0.1).

So each line of ffun2 is a multiplication by a denormalised value; each line of ffun3 is a multiplication by a normalised value. Looking at the generated assembly, I see that the multipliers are calculated before the loop; in each case the loop consists of nothing but multiplications. The most likely explanation for the difference in execution time is that multiplication takes much longer if the multiplier is denormalised.

As for the last question: floating point arithmetic (particularly double precision) is much more complex than integer arithmetic, so on a reasonably modern pipelined processor each instruction will take longer to execute.

Mike Seymour