views:

1237

answers:

8

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.

Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?

I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?

The programs I used as follows, they are not identical by any means:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{
    int accum = 0;

    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += rand( ) % 365;
    }
    std::cout << accum << std::endl;

    return 0;
}

Program 2:

#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>

int main( int argc, char** argv )
{

    float accum = 0;
    srand( time( NULL ) );

    for( unsigned int i = 0; i < 100000000; ++i )
    {
        accum += (float)( rand( ) % 365 );
    }
    std::cout << accum << std::endl;

    return 0;
}

Thanks in advance!

Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.

Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.

Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).

+1  A: 

Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.

The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is gprof.

Edit:

Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:

http://en.wikipedia.org/wiki/Bresenham's_algorithm

Artem
Scientific applications use FP. The only advantage of FP is that precision is scale-invariant. It's like scientific notation. If you know the scale of the numbers already (eg, that the line length is a number of pixels), FP is obviated. But before you get to drawing the line, that's not true.
Potatoswatter
+7  A: 

Addition is much faster than rand, so your program is (especially) useless.

You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?

Generally, attempting FP jobs with integer arithmetic is a recipe for slow.

Potatoswatter
Yeah, as well as the conversion from a rand integer to a float in the floating point version.Any ideas on a better way to test this?
maxpenguin
If you're trying to profile speed, look at POSIX's `timespec_t` or something similar. Record the time at the start and end of the loop and take the difference. Then move the `rand` data generation out of the loop. Make sure that your algorithm gets all its data from arrays and puts all its data in arrays. That gets your actual algorithm by itself, and gets setup, malloc, result printing, everything but task switching and interrupts out of your profiling loop.
Mike D.
@maxpenguin: the question is what you are testing. Artem has assumed you are doing graphics, Carl considered whether you're on an embedded platform sans FP, I supposed you're coding science for a server. You can't generalize or "write" benchmarks. Benchmarks are sampled from the actual work your program does. One thing I can tell you is that it won't remain "essentially the same speed" if you touch the performance-critical element in your program, whatever that is.
Potatoswatter
@Potatoswatter good point and good answer. We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.Anyway, thanks for your help.
maxpenguin
+1  A: 

Based of that oh-so-reliable "something I've heard", back in the old days, integer calculation were about 20 to 50 times faster that floating point, and these days it's less than twice as faster.

James Curran
+5  A: 

Alas, I can only give you an "it depends" answer...

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

The other big variable is where the data resides. If you only have a few values to add, then all of the data can reside in cache, where it can quickly be placed on the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple pieces of data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

Dan
And on some processors FP math can be faster than integer math. The Alpha processor had a FP divide instruction but not an integer one, so integer division had to be done in software.
Gabe
Great information here, thanks.
maxpenguin
A: 

I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:

  • short: 4.260s
  • int: 4.020s
  • long long: 3.350s
  • float: 7.330s
  • double: 7.210s
dan04
Source, compile options, and timing method? I'm a bit surprised by the results.
GMan
Same loop as OP with "rand( ) % 365" replaced by "1". No optimization. User time from "time" command.
dan04
"No optimization" is the key. You never profile with optimization turned off, always profile in "release" mode.
Dean Harding
In this case, though, the optimization off forces the op to occur, and is done deliberately -- the loop is there to dilate time to a reasonable scale of measurement. Using the constant 1 removes the cost of rand(). A sufficiently smart optimizing compiler would see 1 added 100,000,000 times with no way out of the loop and simply add 100000000 in a single op. That sort of gets around the whole purpose, doesn't it?
Stan Rogers
+2  A: 

There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.

For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.

You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.

Ben Voigt
Great information too. Thanks.
maxpenguin
+1 for pointing out how naive benchmarks can yield 0-time loops because of unrolled constant integer operations. Moreover, the compiler can completely discard the loop (integer or FP) if the result is not actually used.
vladr
+5  A: 

For example (lesser numbers are faster),

64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3

short add/sub: 1.005460 [0]
short mul/div: 3.926543 [0]
long add/sub: 0.000000 [0]
long mul/div: 7.378581 [0]
long long add/sub: 0.000000 [0]
long long mul/div: 7.378593 [0]
float add/sub: 0.993583 [0]
float mul/div: 1.821565 [0]
double add/sub: 0.993884 [0]
double mul/div: 1.988664 [0]

32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3

short add/sub: 0.553863 [0]
short mul/div: 12.509163 [0]
long add/sub: 0.556912 [0]
long mul/div: 12.748019 [0]
long long add/sub: 5.298999 [0]
long long mul/div: 20.461186 [0]
float add/sub: 2.688253 [0]
float mul/div: 4.683886 [0]
double add/sub: 2.700834 [0]
double mul/div: 4.646755 [0]

As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)

Poor man's FPU/ALU operation benchmark:

#include <stdio.h>
#ifdef _WIN32
#include <sys/timeb.h>
#else
#include <sys/time.h>
#endif
#include <time.h>

double
mygettime(void) {
# ifdef _WIN32
  struct _timeb tb;
  _ftime(&tb);
  return (double)tb.time + (0.001 * (double)tb.millitm);
# else
  struct timeval tv;
  if(gettimeofday(&tv, 0) < 0) {
    perror("oops");
  }
  return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec);
# endif
}

template< typename Type >
void my_test(const char* name) {
  Type v  = 0;
  // Do not use constants or repeating values
  //  to avoid loop unroll optimizations.
  // All values >0 to avoid division by 0
  // Perform ten ops/iteration to reduce
  //  impact of ++i below on measurements
  Type v0 = (Type)(rand() % 256)/16 + 1;
  Type v1 = (Type)(rand() % 256)/16 + 1;
  Type v2 = (Type)(rand() % 256)/16 + 1;
  Type v3 = (Type)(rand() % 256)/16 + 1;
  Type v4 = (Type)(rand() % 256)/16 + 1;
  Type v5 = (Type)(rand() % 256)/16 + 1;
  Type v6 = (Type)(rand() % 256)/16 + 1;
  Type v7 = (Type)(rand() % 256)/16 + 1;
  Type v8 = (Type)(rand() % 256)/16 + 1;
  Type v9 = (Type)(rand() % 256)/16 + 1;

  double t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v += v0;
    v -= v1;
    v += v2;
    v -= v3;
    v += v4;
    v -= v5;
    v += v6;
    v -= v7;
    v += v8;
    v -= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1);
  t1 = mygettime();
  for (size_t i = 0; i < 100000000; ++i) {
    v /= v0;
    v *= v1;
    v /= v2;
    v *= v3;
    v /= v4;
    v *= v5;
    v /= v6;
    v *= v7;
    v /= v8;
    v *= v9;
  }
  // Pretend we make use of v so compiler doesn't optimize out
  //  the loop completely
  printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1);
}

int main() {
  my_test< short >("short");
  my_test< long >("long");
  my_test< long long >("long long");
  my_test< float >("float");
  my_test< double >("double");

  return 0;
}
vladr
+2  A: 

Two points to consider -

Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.

As always, the only way to be sure is to profile your actual program.

Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx for some information.

John Burton