ansaurus

Question

What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

Answer 1

+7 A:

There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).

Let the compiler do its thing with the program and data flow information it has.

For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"

Joe Koberg 2009-11-17 22:02:10

Well played, sir.

Marcin 2009-11-17 22:04:26

"Defeating" the optimizer is about preventing dead-code elimination, because we're interested in a side-effect the optimizer is designed to avoid: time. The same thing can happen in the much simpler case of trying to see how fast this loop runs: `for (int i = 1; i <= 1000000; ++i) {}`. An optimizer is allowed to eliminate that loop entirely, as far as the language is concerned.

Roger Pate 2009-11-17 22:25:49

R. Pate... indeed.

boytheo 2009-11-17 22:40:27

True points. The benchmark should be in assembly. But the compiler is still doing a lot more - like deciding to do a shift to divide by 2. What if the runtime code says - "check LSB, if 0 shift, if not perform normal integer multiply." Who knows, it might be faster given the functional unit breakdown at that point in the code. Really you're measuring the speed of compiler+processor anytime you benchmark compiled code.

Joe Koberg 2009-11-17 22:48:42

Also the micro-benchmark of how fast the core can execute a stream of nothing but DIVs against registers runs into some problems. Perhaps better to actually include code that produces an end-effect, because "in real life" you will have to store/communicate the results of that computation anyway.

Joe Koberg 2009-11-17 22:52:26

Answer 2

+4 A:

What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.

That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.

Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.

Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".

Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.

Stephen Canon 2009-11-17 22:09:55

The question about "why" I did this... was partly learning, or confirming what I suspected.I think the main lesson I "re-learnt" is there isn't much use in optimising certain things :) I certainly won't be "micro-optimising" any code on doubles soon, I'll be coding in the most simple/natural way for the task at hand. But having a feel for the speed differences may help in the future. It's always good to know how things work.

boytheo 2009-11-17 23:08:33

Oh, absolutely. I'm certainly the last person who would want to come off as an anti-microoptimization "just let the compiler do its job" zealot. I'll write up some notes on how you can get more accurate measurements for this particular benchmark, if you're interested.

Stephen Canon 2009-11-17 23:13:28

Didn't realize I had come off quite so zealously! Anyway I am all in favor of a detailed understanding of how the hardware works. It might be interesting to examine the assembly output of the compiler as well...

Joe Koberg 2009-11-18 17:09:17

@Joe -- the comment wasn't intended to refer to you =). There's just a certain population on SO who seem to believe that this type of thing is *never* worth looking at, and that the compiler will *always* do better than you. Suffice to say, they're wrong.

Stephen Canon 2009-11-18 17:11:50

Answer 3

+1 A:

As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.

Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.

Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.

There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.

Tom Leys 2009-11-17 22:41:50

Be aware that the throughput of division (FP or integer) on Core/Core2/i7 isn't anywhere close to single cycle; multiplication by a precomputed inverse does have single-cycle throughput, however.

Stephen Canon 2009-11-17 22:44:17

Answer 4

A:

I wrote a flawed test to do this on MSVC 2008

double i32Time  = GetTime();
{
 volatile __int32 i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 61;
  count++;
 }
}
i32Time = GetTime() - i32Time;

double i64Time = GetTime();
{
 volatile __int64 i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 61;
  count++;
 }
}
i64Time = GetTime() - i64Time;


double fTime = GetTime();
{
 volatile float i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 4.0f;
  count++;
 }
}
fTime = GetTime() - fTime;

double fmTime = GetTime();
{
 volatile float i = 4;
 const float div = 1.0f / 4.0f;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i *= div;
  count++;
 }
}
fmTime = GetTime() - fmTime;

double dTime = GetTime();
{
 volatile double i = 4;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i /= 4.0f;
  count++;
 }
}
dTime = GetTime() - dTime;

double dmTime = GetTime();
{
 volatile double i = 4;
 const double div = 1.0f / 4.0f;
 __int32 count = 0;
 __int32 max  = 1000000;
 while( count < max )
 {
  i *= div;
  count++;
 }
}
dmTime = GetTime() - dmTime;


DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );

DebugBreak();

I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:

The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).

It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.

To understand such micro-optimisations try reading this pdf.

All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.

Goz 2009-11-17 23:02:17

Answer 5

A:

Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.

I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.

int3 2009-11-18 04:43:08

SSE is targeted by defaut by the OS X compilers. As he's on a SnowLeopard machine, there's no need for the `-mfpmath` flag.

Stephen Canon 2009-11-18 17:06:57

Never knew that, thanks for the info

int3 2009-11-18 19:31:58

ansaurus

tags:

views:

answers:

What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

related questions