views:

1767

answers:

16

We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this

for(int i = 0; i < HUGE_NUMBER; i++)
     int_array[i] = float_array[i];

The default C function which performs the conversion turns out to be quite time consuming.

Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.

+1  A: 

Is the time large enough that it outweighs the cost of starting a couple of threads?

Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.

Martin York
Your intention is fine. Only... think that using 4 cores, each repeatedly hick-uping its X87 pipeline because of the float->int conversion... Such a waste! :) SSE3 fix to the original silicon issue is the right thing (or the magic number addition trick, if SSE3 cannot be guaranteed).
akauppi
+2  A: 

You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.

FryGuy
SSE (or if you're cross-platform Altivec or Neon) will give you roughly the same speed as a memcopy. If bulk-conversion is aproblem the two-liner in assembly or intrinsic-based C is well worth the work.
Nils Pipenbrinck
+1  A: 

There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.

Matt Schmidt
I think FISTTP will _automatically_ enhance the speed of the crippled casts '(int)float_val' via recompilation if:- SSE3 support is enabled ('-msse3' for gcc)- The CPU is SSE3 capableWhile the 'fix' is connected to SSE3 featureset, it is actually a X87 side feature.
akauppi
http://software.intel.com/en-us/articles/how-to-implement-the-fisttp-streaming-simd-extensions-3-instruction/
akauppi
A: 

The C compiler doesn't use any "function", rather the conversion is done by loading the numbers into the FPU and then storing them, so unless you use inline assembly with some sort of SIMD instructions like FryGuy said, you probably won't get any faster.

thirtyseven
Its actually more complicated than that. The C standard says floats should be truncated. CPU registers convert by rounding or truncate, so to be complicate, you need to set the convert mode.
Sanjaya R
A: 

What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?

As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).

Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.

Chris Smith
That Visual Studio setting exists because reordering floating point math can produce slightly different results, even if it shouldn't mathematically, such as "a * (b + c)" vs "a*b + a*c".
Shmoopty
See http://software.intel.com/en-us/articles/how-to-implement-the-fisttp-streaming-simd-extensions-3-instruction/ for how many instructions it takes to convert float->int (on X87).
akauppi
+6  A: 
inline int float2int( double d )
{
   union Cast
   {
      double d;
      long l;
    };
   volatile Cast c;
   c.d = d + 6755399441055744.0;
   return c.l;
}

// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
   d += 6755399441055744.0;
   return reinterpret_cast<int&>(d);
}

for(int i = 0; i < HUGE_NUMBER; i++)
     int_array[i] = float2int(float_array[i]);

The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).

caspin
I don't think this does what you're expecting. The value in l will be the same bit pattern as in d, but it won't be anything similar to the same number: 6.054 != -9620726 (my machine, 32 bit little-endian).
Max Lybbert
@Max: The expectation is a 32-bit "long" type (and, of course, an IEEE-754 double). Given these, this works, although I doubt it could be any faster than "movsd xmm0, mmword ptr [d]; cvttsd2si eax, xmm0; mov dword ptr [i], eax" (which is what my compiler generates for the straight cast).
P Daddy
I learned this trick from the Lua source code. There are some places where it doesn't work, but I've never found one. It works fine on my core2duo and my old pentium.
caspin
This seems like it would have a very low chance of working if the machine isn't exactly what you expect. Also I can't believe it would be better than other methods mentioned here.
Jay Conrod
That's Scary. I assume this is only vaid for a particular type of floating point number (IEEE-754???). I think you should make this explicit in your answer (unless it is true everywhere) an d also note that C++ does not specify a particular floating point standard so you should verify before use.
Martin York
A: 

On Intel your best bet is inline SSE2 calls.

Sanjaya R
+1  A: 

In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).

A: 

I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?

The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:

for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
     int_array[i]   = float_array[i];
     int_array[i+1] = float_array[i+1];
     int_array[i+2] = float_array[i+2];
     int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
     int_array[i]   = float_array[i];

If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,

int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];

Do report back with more info!

Norman Ramsey
It's not a simple instruction. See:http://software.intel.com/en-us/articles/how-to-implement-the-fisttp-streaming-simd-extensions-3-instruction/(hi, Norman! Wouldn't have though of you.... ;)
akauppi
A: 

Throw a Duff's Device at it, too.

Matt Cruikshank
A: 

See this Intel article for speeding up integer conversions:

http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/

According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.

http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx

Mark Ransom
A: 

If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.

If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).


In response to Larry Gritz's comment...

You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).

Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.

WRITING ONLY...
  1866359 ticks with 33554432 array elements (33554432 touched).  Bandwidth: 1.91793 GB/s
  154749 ticks with 262144 array elements (33554432 touched).  Bandwidth: 23.1313 GB/s
  108816 ticks with 8192 array elements (33554432 touched).  Bandwidth: 32.8954 GB/s

USING CASTING...
  5236122 ticks with 33554432 array elements (33554432 touched).  Bandwidth: 0.683625 GB/s
  2014309 ticks with 262144 array elements (33554432 touched).  Bandwidth: 1.77706 GB/s
  1967345 ticks with 8192 array elements (33554432 touched).  Bandwidth: 1.81948 GB/s

USING xs_CRoundToInt...
  1490583 ticks with 33554432 array elements (33554432 touched).  Bandwidth: 2.40144 GB/s
  1079530 ticks with 262144 array elements (33554432 touched).  Bandwidth: 3.31584 GB/s
  1008407 ticks with 8192 array elements (33554432 touched).  Bandwidth: 3.5497 GB/s

(Windows) source code:

// floatToIntTime.cpp : Defines the entry point for the console application.
//

#include <windows.h>
#include <iostream>

using namespace std;

double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) { 
  val = val + dmr; 
  return ((int*)&val)[0]; 
} 

static size_t const N = 256*1024*1024/sizeof(double);
int    I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);

static size_t const Sz[]    = {N,     L2CACHE/2,     L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};

int main(int argc, char *argv[])
{
  __int64 freq;
  QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

  cout << "WRITING ONLY..." << endl;
  for (int t=0; t<3; t++) {
    __int64 t0,t1;
    QueryPerformanceCounter((LARGE_INTEGER*)&t0);
    size_t const niter = NIter[t];
    size_t const sz    = Sz[t];
    for (size_t i=0; i<niter; i++) {
      for (size_t n=0; n<sz; n++) {
        I[n] = 13;
      }
    }
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
    double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
    cout << "  " << (t1-t0) << " ticks with " << sz 
         << " array elements (" << niter*sz << " touched).  " 
         << "Bandwidth: " << bandwidth << " GB/s" << endl;
  }

  cout << "USING CASTING..." << endl;
  for (int t=0; t<3; t++) {
    __int64 t0,t1;
    QueryPerformanceCounter((LARGE_INTEGER*)&t0);
    size_t const niter = NIter[t];
    size_t const sz    = Sz[t];
    for (size_t i=0; i<niter; i++) {
      for (size_t n=0; n<sz; n++) {
        I[n] = (int)F[n];
      }
    }
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
    double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
    cout << "  " << (t1-t0) << " ticks with " << sz 
         << " array elements (" << niter*sz << " touched).  " 
         << "Bandwidth: " << bandwidth << " GB/s" << endl;
  }

  cout << "USING xs_CRoundToInt..." << endl;
  for (int t=0; t<3; t++) {
    __int64 t0,t1;
    QueryPerformanceCounter((LARGE_INTEGER*)&t0);
    size_t const niter = NIter[t];
    size_t const sz    = Sz[t];
    for (size_t i=0; i<niter; i++) {
      for (size_t n=0; n<sz; n++) {
        I[n] = xs_CRoundToInt(F[n]);
      }
    }
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
    double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
    cout << "  " << (t1-t0) << " ticks with " << sz 
         << " array elements (" << niter*sz << " touched).  " 
         << "Bandwidth: " << bandwidth << " GB/s" << endl;
  }

  return 0;
}
Mr Fooz
It really isn't that. It's the X87.
akauppi
A: 

most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior. for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding. read up on the fld and fistp fpu instructions and the fpu control word.

starmole
+5  A: 

Most of the other answers here just try to eliminate loop overhead.

Only Caspin's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. Caspin's solution is correct, though he gives no citation or explanation.

Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU

Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.

Larry Gritz
+1  A: 

The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.

If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):

for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
   __m128 a = sse_load4(float_array+i+0);
   __m128 b = sse_load4(float_array+i+4);
   __m128 c = sse_load4(float_array+i+8);
   __m128 d = sse_load4(float_array+i+12);
   a = sse_convert4(a);
   b = sse_convert4(b);
   c = sse_convert4(c);
   d = sse_convert4(d);
   sse_write4(int_array+i+0, a);
   sse_write4(int_array+i+4, b);
   sse_write4(int_array+i+8, c);
   sse_write4(int_array+i+12, d);
}

The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)

Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.

I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.

Crashworks
+1  A: 

I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.

Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.

Crashworks