views:

656

answers:

3

Hi all,

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).

Here are some benchmarks for performing floor for all cells in a 2D matrix:

VC:  0.10
ICL: 0.20

Compare that to a simple cast:

VC:  0.04
ICL: 0.04

How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers). 2nd question: Does someone know of a super-fast floor() implementation?

Thanks in advance,

Roger

PS: Here is the loop that I was benchmarking:

void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
    float *rowA=NULL;
    int   *intRowA=NULL;
    int   row, col;

    for(row=0 ; row<height ; ++row){
        rowA = matA + row*width_aligned;
        intRowA = intA + row*width_aligned;
#pragma ivdep
        for(col=0 ; col<width; ++col){
            /*intRowA[col] = floor(rowA[col]);*/
            intRowA[col] = (int)(rowA[col]);
        }
    }
}
+2  A: 
  1. They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result. Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
  2. Probably floor() is already optimized.
  3. Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler? Jon Bentley's Programming Pearls has a great review of possible optimizations.
Yuval F
+11  A: 

A couple of things make floor slower than a cast and prevent vectorization.

The most important one:

floor can modify the global state. If you pass a value that is to huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behaviour is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).

Detecting these problematic conditions is not simple and makes up more than 90% of the excution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.

Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.

Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

Nils Pipenbrinck
A: 

Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.

I sometimes use a macro to approximate floor():

#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))

It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.

jrgc
Naive implementation. PSEUDO_FLOOR( x++ ) would break this.
Charlie Somerville