ansaurus

Question

32-bit versus 64-bit floating-point performance

Answer 1

+22 A:

On x87 at least, everything is really done in 80-bit precision internally. The precision really just determines how many of those bits are stored in memory. This is part of the reason why different optimization settings can change results slightly: They change the amount of rounding from 80-bit to 32- or 64-bit.

In practice, using 80-bit floating point (long double in C and C++, real in D) is usually slow because there's no efficient way to load and store 80 bits from memory. 32- and 64-bit are usually equally fast provided that memory bandwidth isn't the bottleneck, i.e. if everything is in cache anyhow. 64-bit can be slower if either of the following happens:

Memory bandwidth is the bottleneck.
The 64-bit numbers aren't properly aligned on 8-byte boundaries. 32-bit numbers only require 4-byte alignment for optimal efficiency, so they're less finicky. Some compilers (the Digital Mars D compiler comes to mind) don't always get this right for 64-bit doubles stored on the stack. This causes twice the amount of memory operations to be necessary to load one, in practice resulting in about a 2x performance hit compared to properly aligned 64-bit floats or 32-bit floats.

As far as SIMD optimizations go, it should be noted that most compilers are horrible at auto-vectorizing code. If you don't want to write directly in assembly language, the best way to take advantage of these instructions is to use things like array-wise operations, which are available, for example, in D, and implemented in terms of SSE instructions. Similarly, in C or C++, you would probably want to use a high level library of functions that are SSE-optimized, though I don't know of a good one off the top of my head because I mostly program in D.

dsimcha 2010-06-29 03:24:40

"x87" - Slightly better than those old x86 processors. :-)

Thanatos 2010-07-10 01:57:06

http://en.wikipedia.org/wiki/X87

advs89 2010-07-10 02:20:47

Answer 2

A:

It's probably because your processor still makes the 64bit counting and then trimms the number. There was some CPU flag you could change, but I can't remember...

Alistra 2010-07-06 23:07:32

Answer 3

A:

First check the ASM that gets produced. It may not be what you expect.

Also try writing it as a loop:

typedef float fp;
fp q = 0
for(int i = 0; i < N; i++)
  q += x[i]*y[i]*z[i]

Some compiler might notice the loop and not the unrolled form.

Lastly, your code used () rather than []. If your code is making lots of function calls (12 to 21), that will swamp the FP cost and even removing the fp computation all together won't make much difference. Inlineing OTOH might.

BCS 2010-07-10 01:54:59

thanks, actually `q()` are macros converting directly to raw pointer access

aaa 2010-07-10 02:21:26

@aaa: Well if there is any math at all, it might still be a large percentage. Also, I don't know how well compiler deal with mixing FP and other stuff. That might be enough to block it from using vector ops.

BCS 2010-07-10 16:04:56

ansaurus

tags:

views:

answers:

32-bit versus 64-bit floating-point performance

related questions