views:

171

answers:

5

I recently read this post: http://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware and was curious as to the performance of my own processor on this quasi-benchmark, so I put together two versions of the code, one in C# and one in C++ (Visual Studio 2010 Express) and compiled them both with optimizations to see what falls out. The output from my C# version is fairly reasonable:

int add/sub: 350ms
int div/mul: 3469ms
float add/sub: 1007ms
float div/mul: 67493ms
double add/sub: 1914ms
double div/mul: 2766ms

When I compiled and ran the C++ version something completely different shook out:

int add/sub: 210.653ms
int div/mul: 2946.58ms
float add/sub: 3022.58ms
float div/mul: 172931ms
double add/sub: 1007.63ms
double div/mul: 74171.9ms

I expected some performance differences, but not this large! I don't understand why the division/multiplication in C++ is so much slower than addition/subtraction, where the managed C# version is more reasonable to my expectations. The code for the C++ version of the function is as follows:

template< typename T> void GenericTest(const char *typestring)
{
    T v = 0;
    T v0 = (T)((rand() % 256) / 16) + 1;
    T v1 = (T)((rand() % 256) / 16) + 1;
    T v2 = (T)((rand() % 256) / 16) + 1;
    T v3 = (T)((rand() % 256) / 16) + 1;
    T v4 = (T)((rand() % 256) / 16) + 1;
    T v5 = (T)((rand() % 256) / 16) + 1;
    T v6 = (T)((rand() % 256) / 16) + 1;
    T v7 = (T)((rand() % 256) / 16) + 1;
    T v8 = (T)((rand() % 256) / 16) + 1;
    T v9 = (T)((rand() % 256) / 16) + 1;

    HTimer tmr = HTimer();
    tmr.Start();
    for (int i = 0 ; i < 100000000 ; ++i)
    {
        v += v0;
        v -= v1;
        v += v2;
        v -= v3;
        v += v4;
        v -= v5;
        v += v6;
        v -= v7;
        v += v8;
        v -= v9;
    }
    tmr.Stop();

      // I removed the bracketed values from the table above, they just make the compiler
      // assume I am using the value for something do it doesn't optimize it out.
    cout << typestring << " add/sub: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl;

    tmr.Start();
    for (int i = 0 ; i < 100000000 ; ++i)
    {
        v /= v0;
        v *= v1;
        v /= v2;
        v *= v3;
        v /= v4;
        v *= v5;
        v /= v6;
        v *= v7;
        v /= v8;
        v *= v9;
    }
    tmr.Stop();

    cout << typestring << " div/mul: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl;
}

The code for the C# tests are not generic, and are implemented thus:

static double DoubleTest()
{
    Random rnd = new Random();
    Stopwatch sw = new Stopwatch();

    double v = 0;
    double v0 = (double)rnd.Next(1, int.MaxValue);
    double v1 = (double)rnd.Next(1, int.MaxValue);
    double v2 = (double)rnd.Next(1, int.MaxValue);
    double v3 = (double)rnd.Next(1, int.MaxValue);
    double v4 = (double)rnd.Next(1, int.MaxValue);
    double v5 = (double)rnd.Next(1, int.MaxValue);
    double v6 = (double)rnd.Next(1, int.MaxValue);
    double v7 = (double)rnd.Next(1, int.MaxValue);
    double v8 = (double)rnd.Next(1, int.MaxValue);
    double v9 = (double)rnd.Next(1, int.MaxValue);

    sw.Start();
    for (int i = 0; i < 100000000; i++)
    {
        v += v0;
        v -= v1;
        v += v2;
        v -= v3;
        v += v4;
        v -= v5;
        v += v6;
        v -= v7;
        v += v8;
        v -= v9;
    }
    sw.Stop();

    Console.WriteLine("double add/sub: {0}", sw.ElapsedMilliseconds);
    sw.Reset();

    sw.Start();
    for (int i = 0; i < 100000000; i++)
    {
        v /= v0;
        v *= v1;
        v /= v2;
        v *= v3;
        v /= v4;
        v *= v5;
        v /= v6;
        v *= v7;
        v /= v8;
        v *= v9;
    }
    sw.Stop();

    Console.WriteLine("double div/mul: {0}", sw.ElapsedMilliseconds);
    sw.Reset();

    return v;
}

Any ideas here?

A: 

If you're interested in floating point speed and possible optimizations, read this book: http://www.agner.org/optimize/optimizing_cpp.pdf

also you can check this: http://msdn.microsoft.com/en-us/library/aa289157%28VS.71%29.aspx

Your results could depend on things such as JIT, compilation flags (debug/release, what kind of FP optimizations to perform or allowed instruction set).

Try setting these flags to max optimizations and change your program, so that it surely won't produce overflows or NANs, because they affect the computation speed. (even something like "v += v1; v += v2; v -= v1; v -= v2;" is ok, because it won't be reduced on "strict" or "precise" floating point mode). Also try not to use more variables than you have FP registers.

ruslik
+1  A: 

Multiplication isn't bad. I think it is a few cycles slower than addition, but yes, division is very slow, compared to the others. It takes significantly longer, and unlike the other 3 operations, it is not pipelined.

jalf
+3  A: 

It's possible that C# optimized the division by vx to multiplication by 1 / vx since it knows those values aren't modified during the loop and it can compute the inverses just once up front.

You can do this optimization yourself and time it in C++.

Mark B
+2  A: 

For the float div/mul tests, you're probably getting denormalized values, which are much slower to process that normal floating point values. This isn't an issue for the int tests and would crop up much later for the double tests.

You should be able to add this to the start of the C++ to flush denormals to zero:

_controlfp(_DN_FLUSH, _MCW_DN);

I'm not sure how to do it in C# though (or if it's even possible).

Some more info here: http://stackoverflow.com/questions/2051534/floating-point-math-execution-time

celion
This solved it. Adding that line to the GenericTest function caused it to execute much more reasonably. 1sec for float adds, 1.3sec for float mul/divs, 1sec for double adds, 1.6sec for double mul/divs.
Chris D.
I'm still not sure that it's not a flawed benchmark, but at least now it's *less* flawed :)
celion
A: 

I also decided that your C++ was incredibly slow. So I ran it myself. Turns out that actually, you're totally wrong. fail

I replaced your timer (I've no idea what timer you were using, but I don't have one handy) with the Windows High Performance Timer. That thing can do nanoseconds or better. Guess what? Visual Studio says no. I didn't even tweak it for the highest performance. VS can see right through this sort of crap and ellided all of the loops. That's why you should never, ever, ever use this sort of "profiling". Get a professional profiler and come back. Unless 2010 Express is different to 2010 Professional, which I doubt. They mainly differ in IDE features, not raw code performance/optimization.

I'm not even going to bother running your C#.

Edit: This is DEBUG x64 (the previous screen is x86, but I thought I'd do x64 since I am on x64) and I also fixed a minor bug that caused the time to be negative rather than positive. So unless you want to tell me that your release FP on 32bit is a hundred times slower, I think you've screwed up. alt text

One thing I did find curious is that the x86 debug program never terminated on the second float test, i.e., if you did float first, then double, it was double div/mul that failed. If you did double then float, the float div/mul failed. Must be a compiler glitch.

DeadMG
uhm 0 nanoseconds, are you using any nasa pc?
PoweRoy
@PoweRoy: Read the post. The point is that the compiler optimized all of it away.
DeadMG
Seriously, you posted a screenshot instead of actual code?
SoapBox
@SoapBox: The OP already posted my code. Frankly, I just couldn't be bothered to type the results out.
DeadMG
@DeadMG How come the first and second timings are different? First one is only showing 0 (and you don't explain why).
PoweRoy
@PoweRoy: ... Perhaps you could read the post. The second run was in DEBUG mode, i.e., all compiler optimizations disabled and additional overhead for debugging, and it's still a hundred times faster than the OP's time.
DeadMG
I am using the High-Performance Timer, QueryPerformanceCounter in both cases. I have never had the compiler optimize the loops away, what compiler options did you set?
Chris D.
@Chris D: All I did was change it to release. That was it. Didn't go through every possible optimization option or even set overall optimization level.
DeadMG