ansaurus

Question

What's the relative speed of floating point add vs. floating point multiply

Answer 1

A:

I can't find a definitive reference, but extensive experimentation tells me that float multiplication nowadays is just about the same speed as addition and subtraction, while division isn't (but not "many times" slower, either). You can get the intuition you desire only by running your own experiments -- remember to generate the random numbers (millions of them) in advance, read them before you start timing, and use the CPU's performance counters (with no other process running, as much as you can stop them from) for accurate measurement!

Alex Martelli 2009-07-18 02:15:45

Answer 2

A:

There is probably very little difference in time between multiplication and addition. division on the other hand is still significantly slower then multiplication because of its recursive nature. on modern x86 architecture sse instructions should be considered when doing floating point operation rather then using the fpu.Though a good C/C++ compiler should give you the option of using sse instead of the fpu.

hacim 2009-07-18 02:21:47

Answer 3

+1 A:

The speed difference of * / vs + - depends on your processor architecture. In general and with x86 in particular the speed difference has become less with modern processors. * should be close to +, when in doubt: just experiment. If you have a really hard problem with lots of FP operations also consider using your GPU (GeForce, ...) which works as a vector processor.

2009-07-18 02:29:05

Answer 4

+4 A:

The best way to answer this question is to actually write a benchmark/profile of the processing you need to do. Empirical should be used over theoretical when ever possible. Especially when it easy to attain.

If you already know different implementations of the Math you need to do, you could write a a few different code transfermations of the math and see where your performance peaks. This will allow the processor/compiler to generate different execution streams to fill the processor pipelines and give you a concrete answer to your answer.

If you are interest in specifically the performance of DIV/MUL/ADD/SUB type instructions you could even toss in some inline assembly to control specifically which variants of these instruction are executed. However you need to make sure you're keeping multilple execution units busy to get a good idea of the performance the system is capable of.

Also doing something like this would allow you to compare performance on multiple variations of the processor by simply running the same program on them, and could also allow you to factor in the motherboard differences.

Edit:

Basic architecture of a +- is identical. So they logically take the same time to compute. * on the other hand, require multiple layers, typically constructed out of "full adders" to complete a single operation. This garentees that while a * can be issued to the pipeline every cycle it will have a higher latency than an add/subtract circuit. A fp / operation is typically implemented using an approximation method which iteratively converges towards the correct answer over time. These types of approximations are typically implemented via multiplication. So for floating point you can generally assume division will take longer because it's impractical to "unroll" the multiplications (which is already a large circuit in and of it's self) into pipeline of a multitude of multiplier circuits. Still the performance of a given system is best measured via testing.

NoMoreZealots 2009-07-18 16:35:13

Answer 5

+5 A:

In theory the information is here:

Intel®64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT

For every processor they list, the latency on FMUL is very close to that of FADD or FDIV. On some of the older processors, FDIV is 2-3 time slower than that, while on newer processors, it's the same as FMUL.

Caveats:

The document I linked actually says you can't rely on these numbers in real life since the processor will do what it wishes to make things faster if it's correct.
There's a good chance your compiler will decide to use one of the many newer instruction sets that have a floating-point multiply / divide available.
This is a complicated document only meant to be read by compiler writers and I might have gotten it wrong. Like I'm not clear why the FDIV latency number is completely missing for some of the CPUs.

smackfu 2009-07-18 17:22:22

Very cool document. I think one thing that remains consistent (and this document shows it) is that division is still far slower than multiplication, addition, and subtraction. From the looks of this document, the latency of double-precision division is 10X slower than multiplication. So, for example, I believe calling x = y * 0.5 should be faster than calling x = y / 2.

Steve Wortham 2009-07-20 15:20:16

Answer 6

+6 A:

It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

Eamon Nerbonne 2009-07-18 18:22:47

Instruction mix is a good point, I have people I work with who insist that a 200 Floating Point DSP is going to out perform a 600 fixed point DSP. They do absolutely no tight loop processing, and spend more time processing I/O than doing calcuations. A faster fixed point processor would win based on the overall instruction mix, but people just think FP units are magic rather than a HW implementation of a datastructure.

NoMoreZealots 2009-07-18 19:23:57

Ah yes, the magic appproach ;-) - that's unfortunate.

Eamon Nerbonne 2009-07-20 15:05:48

nice explanation with intuitive examples!

Sebastian Good 2009-09-11 01:54:29

ansaurus

tags:

views:

answers:

What's the relative speed of floating point add vs. floating point multiply

related questions