views:

707

answers:

5

Hi everyone,

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. I had to really work hard to defeat the optimiser. I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true).

After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. They maybe of interest to someone else?

If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P

64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us

"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. I did not test the accuracy of division. Could it be that "proper division" is more accurate? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either).

Here's the code I used to get the results:

#include <iostream>

int Run(int bla, int div, int add, int minus) {
 // these parameters are to force the compiler to not be able to optimise away the
 // multiplications and divides :)
 long LoopMax = 100000000;

 uint32_t Origbla32 = 1000000000;
 long i = 0;

 uint32_t bla32 = Origbla32;
 uint32_t div32 = div;
 clock_t Time32 = clock();
 for (i = 0; i < LoopMax; i++) {
  div32 += add;
  div32 -= minus;
  bla32 = bla32 / div32;
  bla32 += bla;
  bla32 = bla32 * div32;
 }
 Time32 = clock() - Time32;

 uint64_t bla64 = bla32;
 clock_t Time64 = clock();
 uint64_t div64 = div;
 for (long i = 0; i < LoopMax; i++) {
  div64 += add;
  div64 -= minus;
  bla64 = bla64 / div64;
  bla64 += bla;
  bla64 = bla64 * div64;
 }
 Time64 = clock() - Time64;

 double blaDMul = Origbla32;
 double multodiv = 1.0 / (double)div;
 double multomul = div;
 clock_t TimeDMul = clock();
 for (i = 0; i < LoopMax; i++) {
  multodiv += add;
  multomul -= minus;
  blaDMul = blaDMul * multodiv;
  blaDMul += bla;
  blaDMul = blaDMul * multomul;
 }
 TimeDMul = clock() - TimeDMul;

 double blaDDiv = Origbla32;
 clock_t TimeDDiv = clock();
 for (i = 0; i < LoopMax; i++) {
  multodiv += add;
  multomul -= minus;
  blaDDiv = blaDDiv / multomul;
  blaDDiv += bla;
  blaDDiv = blaDDiv / multodiv;
 }
 TimeDDiv = clock() - TimeDDiv;

 float blaS = Origbla32;
 float divS = div;
 clock_t TimeS = clock();
 for (i = 0; i < LoopMax; i++) {
  divS += add;
  divS -= minus;
  blaS = blaS / divS;
  blaS += bla;
  blaS = blaS * divS;
 }
 TimeS = clock() - TimeS;

 printf("64 time: %i us  (%i)\n", (int)Time64, (int)bla64);
 printf("32 time: %i us  (%i)\n", (int)Time32, bla32);

 printf("D(mul) time: %i us  (%f)\n", (int)TimeDMul, blaDMul);
 printf("D(div) time: %i us  (%f)\n", (int)TimeDDiv, blaDDiv);
 printf("S time: %i us  (%f)\n", (int)TimeS, blaS);

 return 0;
}

int main(int argc, char* const argv[]) {
 Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
 return 0;
}
+7  A: 

There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).

Let the compiler do its thing with the program and data flow information it has.

For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"

Joe Koberg
Well played, sir.
Marcin
"Defeating" the optimizer is about preventing dead-code elimination, because we're interested in a side-effect the optimizer is designed to avoid: time. The same thing can happen in the much simpler case of trying to see how fast this loop runs: `for (int i = 1; i <= 1000000; ++i) {}`. An optimizer is allowed to eliminate that loop entirely, as far as the language is concerned.
Roger Pate
R. Pate... indeed.
boytheo
True points. The benchmark should be in assembly. But the compiler is still doing a lot more - like deciding to do a shift to divide by 2. What if the runtime code says - "check LSB, if 0 shift, if not perform normal integer multiply." Who knows, it might be faster given the functional unit breakdown at that point in the code. Really you're measuring the speed of compiler+processor anytime you benchmark compiled code.
Joe Koberg
Also the micro-benchmark of how fast the core can execute a stream of nothing but DIVs against registers runs into some problems. Perhaps better to actually include code that produces an end-effect, because "in real life" you will have to store/communicate the results of that computation anyway.
Joe Koberg
+4  A: 

What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.

That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.

Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.

Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".

Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.

Stephen Canon
The question about "why" I did this... was partly learning, or confirming what I suspected.I think the main lesson I "re-learnt" is there isn't much use in optimising certain things :) I certainly won't be "micro-optimising" any code on doubles soon, I'll be coding in the most simple/natural way for the task at hand. But having a feel for the speed differences may help in the future. It's always good to know how things work.
boytheo
Oh, absolutely. I'm certainly the last person who would want to come off as an anti-microoptimization "just let the compiler do its job" zealot. I'll write up some notes on how you can get more accurate measurements for this particular benchmark, if you're interested.
Stephen Canon
Didn't realize I had come off quite so zealously! Anyway I am all in favor of a detailed understanding of how the hardware works. It might be interesting to examine the assembly output of the compiler as well...
Joe Koberg
@Joe -- the comment wasn't intended to refer to you =). There's just a certain population on SO who seem to believe that this type of thing is *never* worth looking at, and that the compiler will *always* do better than you. Suffice to say, they're wrong.
Stephen Canon
+1  A: 

As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.

Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.

Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.

There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.

Tom Leys
Be aware that the throughput of division (FP or integer) on Core/Core2/i7 isn't anywhere close to single cycle; multiplication by a precomputed inverse does have single-cycle throughput, however.
Stephen Canon
A: 

I wrote a flawed test to do this on MSVC 2008

double i32Time  = GetTime();
{
 volatile __int32 i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 61;
  count++;
 }
}
i32Time = GetTime() - i32Time;

double i64Time = GetTime();
{
 volatile __int64 i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 61;
  count++;
 }
}
i64Time = GetTime() - i64Time;


double fTime = GetTime();
{
 volatile float i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 4.0f;
  count++;
 }
}
fTime = GetTime() - fTime;

double fmTime = GetTime();
{
 volatile float i = 4;
 const float div = 1.0f / 4.0f;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i *= div;
  count++;
 }
}
fmTime = GetTime() - fmTime;

double dTime = GetTime();
{
 volatile double i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 4.0f;
  count++;
 }
}
dTime = GetTime() - dTime;

double dmTime = GetTime();
{
 volatile double i = 4;
 const double div = 1.0f / 4.0f;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i *= div;
  count++;
 }
}
dmTime = GetTime() - dmTime;


DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );

DebugBreak();

I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:

0.006622
0.054654
0.006283
0.006353
0.006203
0.006161

The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).

It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.

To understand such micro-optimisations try reading this pdf.

All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.

Goz
A: 

Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.

I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.

int3
SSE is targeted by defaut by the OS X compilers. As he's on a SnowLeopard machine, there's no need for the `-mfpmath` flag.
Stephen Canon
Never knew that, thanks for the info
int3