Which algorithms benefit most from fused multiply add?

tags:

floating-point

views:

answers:

+3 Q:

Which algorithms benefit most from fused multiply add?

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result.

Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding?

It's not obvious, as rounding after multiplications which we avoid tends to be less problematic than rounding after addition which we don't.

Off the top of my head - Matrix multiplication, Newton's rule, polynomial evaluation, numerical methods

WOPR 2010-08-28 04:10:54

+2 A:

The primary benefit of FMA is that it can be twice as fast. Rather than take 1 cycle for the multiply and then 1 cycle for the add, the FPU can issue both operations in the same cycle. Obviously, most algorithms will benefit from faster operations.

Gabe 2010-08-28 04:14:39

Question is about impact of rounding, not about this. Your answer is also incorrect as fma requires 3 input floating point unit instead of standard 2 inputs, extra port in floating point register file, and wider floating point adders This isn't free, it's a trade-off of fma support at cost of some other hardware.

taw 2010-08-28 13:26:35

taw: You asked what algorithms benefit from FMA and for some examples where the rounding is a non-trivial benefit. I answered the first part, which is that most algorithms will benefit.

Gabe 2010-08-28 16:17:37

+1 A:

Some examples: Vector dot products. Fourier transforms. Digital signal processing. Polynomials. All sorts of things.

It's a question of optimization and hardware exploitation more that anything else. A sum of products is a very common requirement in numerical methods, and this way lets you give an explicit instruction to the compiler about how to do a thing fast and perhaps with a little more precision. Unless I'm mistaken, the compiler is free to replace a=b*c+d with an FMA instruction, but it's also free not to. (unless the standard calls for rounding, but real-world compilers routinely violate standards in small ways).

Ian 2010-08-28 06:18:34

+1 A:

The only thing I found so far are "error-free transformations". For any floating point numbers errors from a+b, a-b, and a*b are also floating point numbers (in round to nearest mode, assuming no overflow/underflow etc. etc.).

Addition (and obviously subtraction) error is easy to compute; if abs(a) >= abs(b), error is exactly b-((a+b)-a) (2 flops, or 4-5 if we don't know which is bigger). Multiplication error is trivial to compute with fma - it is simply fma(a,b,-a*b). Without fma it's 16 flops of rather nasty code. And fully generic emulation of correctly rounded fma is even slower than that.

Extra 16 flops of error tracking per flop of real computation is a huge overkill, but with just 1-5 pipeline-friendly flops it's quite reasonable, and for many algorithms based on that 50%-200% overhead of error tracking and compensation results in error as small as if all calculations were done in twice the number of bits they were, avoiding ill-conditioning in many cases.

Interestingly, fma isn't ever used in these algorithms to compute results, just to find errors, because finding error of fma is a slow as finding error of multiplication was without fma.

Relevant keywords to search would be "compensated Horner scheme" and "compensated dot product", with Horner scheme benefiting a lot more.

taw 2010-08-28 15:34:36

+1 A:

taw hit on one important example; more generally, FMA allows library writers to efficiently implement many other floating-point operations with correct rounding.

For example, a platform that has an FMA can use it to implement correctly rounded divide and square root (PPC and Itanium took this approach), which lets the FPU be basically a single-purpose FMA machine. Peter Tang and John Harrison (Intel), and Peter Markstein (HP) have some papers that explain this use if you're curious.

The example taw gave is more broadly useful than just in tracking error bounds. It allows you to represent the product of two floating point numbers as a sum of two floating point numbers without any rounding error; this is quite useful in implementing correctly-rounded floating-point library functions. Jean-Michel Muller's book or the papers on crlibm would be good starting places to learn more about these uses.

FMA is also broadly useful in argument reduction in math-library style routines for certain types of arguments; when one is doing argument reduction, the goal of the computation is often a term of the form (x - a*b), where (a*b) is very nearly equal to x itself; in particular, the result is often on the order of the rounding error in the (a*b) term, if this is computed without an FMA. I believe that Muller has also written some about this in his book.

Stephen Canon 2010-09-07 14:24:20

ansaurus

tags:

views:

answers:

Which algorithms benefit most from fused multiply add?

related questions