views:

252

answers:

9

Hi folks,

How can I improve / speed up this frequent function?

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define M 10 // This is fixed
#define N 8  // This is NOT fixed

// Assumptions: 1. x, a, b and c are all arrays of 10 (M).
//              2. y and z are all matrices of 8 x 10 (N x M).
// Requirement: 1. return the value of ret;
//              2. get all elements of array c
float fnFrequentFunction(const float* x, const float* const* y, const float* const* z,
                         const float* a, const float* b, float *c, int n)
{
    register float tmp;
    register float sum;
    register float ret = 0;
    register const float* yy;
    register const float* zz;
    int i;

    for (i = 0; i < n; i++)  // M == 1, 2, 4, or 8
    {
        sum = 0;
        yy = y[i];
        zz = z[i];

        tmp = x[0] - yy[0]; sum += tmp * tmp * zz[0];
        tmp = x[1] - yy[1]; sum += tmp * tmp * zz[1];
        tmp = x[2] - yy[2]; sum += tmp * tmp * zz[2];
        tmp = x[3] - yy[3]; sum += tmp * tmp * zz[3];
        tmp = x[4] - yy[4]; sum += tmp * tmp * zz[4];
        tmp = x[5] - yy[5]; sum += tmp * tmp * zz[5];
        tmp = x[6] - yy[6]; sum += tmp * tmp * zz[6];
        tmp = x[7] - yy[7]; sum += tmp * tmp * zz[7];
        tmp = x[8] - yy[8]; sum += tmp * tmp * zz[8];
        tmp = x[9] - yy[9]; sum += tmp * tmp * zz[9];

        ret += (c[i] = log(a[i] * b[i]) + sum);
    }

    return ret;
}

// In the main function, all values are just example data.
int main()
{
    float x[M] = {0.001251f, 0.563585f, 0.193304f, 0.808741f, 0.585009f, 0.479873f, 0.350291f, 0.895962f, 0.622840f, 0.746605f};
    float* y[N];
    float* z[N];
    float a[M] = {0.870205f, 0.733879f, 0.711386f, 0.588244f, 0.484176f, 0.852962f, 0.168126f, 0.684286f, 0.072573f, 0.632160f};
    float b[M] = {0.871487f, 0.998108f, 0.798608f, 0.134831f, 0.576281f, 0.410779f, 0.402936f, 0.522935f, 0.623218f, 0.193030f};
    float c[N];

    float t1[M] = {0.864406f, 0.709006f, 0.091433f, 0.995727f, 0.227180f, 0.902585f, 0.659047f, 0.865627f, 0.846767f, 0.514359f};
    float t2[M] = {0.866817f, 0.581347f, 0.175542f, 0.620197f, 0.781823f, 0.778588f, 0.938688f, 0.721610f, 0.940214f, 0.811353f};
    int i, j;

    int n = 10000000;
    long start;

    // Initialize y, z for test example:
    for(i = 0; i < N; ++i)
    {
        y[i] = (float*)malloc(sizeof(float) * M);
        z[i] = (float*)malloc(sizeof(float) * M);

        for(j = 0; j < M; ++j)
        {
            y[i][j] = t1[j] * j;
            z[i][j] = t2[j] * j;
        }
    }


    // Speed test here:
    start = clock();
    while(--n)
        fnFrequentFunction(x, y, z, a, b, c, 8);
    printf("Time used: %ld\n", clock() - start);


    // Output the result here:
    printf("fnFrequentFunction == %f\n", fnFrequentFunction(x, y, z, a, b, c, 8));
    for(j = 0; j < N; ++j)
        printf("  c[%d] == %f\n", j, c[j]);
    printf("\n");


    // Free memory
    for(j = 0; j < N; ++j)
    {
        free(y[j]);
        free(z[j]);
    }

    return 0;
}

Any suggestions are welcome :-)

I feel terrible that I made a big mistake in my function. The above code is the new one. I'm rechecking it now to make sure that is what I need.

+12  A: 

put this outside the loop

sum = 0;

tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
tmp = x[9] - y[9]; sum += tmp * tmp * z[9];
Andrey
++ Exactly. This needs to be computed once per call.
John at CashCommons
I guess I was writing it at the same time as you just a little slower.
Romain Hippeau
It could be a simplified example, though.
slacker
M is a variable, so putting that code outside the loop is kind of ...., and furthermore, I'm afraid that i won't speed up much. Anyway, it's a good idea. Thx for your reply :-)
Peter Lee
Sorry for the confusion of the example.
Peter Lee
@Peter Lee:Notice that this piece of code is completely independent of M. It simply repeats the same calculation on the same data needlessly each iteration. It should be done only once, before the loop.
slacker
@slacker, I'm sorry that I made a terrible mistake in the code. I will correct it ASAP.
Peter Lee
@Peter Lee:It's not that terrible ;). The cost of this is relatively small, compared to the `log()`, so the speedup won't be large. If at all, as it is possible that your compiler is smart enough to do this optimization for you automatically.
slacker
@slacker: I would say it IS a terrible mistake. I updated my code. (y and z are matrices instead of arrays); We are using log(), because the real data is too large.
Peter Lee
@Peter Lee:Is the `log()` necessary, then? Wouldn't switching to `double` suffice? I would recommend avoiding complex functions like `log()` if only possible, as they are simply **DAMN SLOW**.
slacker
+2  A: 
  • This function is perfectly amenable to SIMD processing. Look into your compiler documentation for the intrinsic functions that correspond to the SSE instructions.
  • You could break up the dependence chain on the sum variable. Instead of a single sum accumulator, use two accumulators sum1 and sum2 alternately - one for even, one for odd indices. Add them up afterwards.
  • The single biggest performance bottleneck here is the log() function. Check if an approximation would be sufficient. The calculation of this could also be vectorized - I believe Intel published a high-performance math library - including vectorized versions of functions like log(). You may like to use this.
  • You are operating on floats here, and log() uses double precision. Use logf() instead. It may (or may not) be faster. It will certainly be no slower.
  • If your compiler understands C99, place a restrict qualifier on the pointers which are function arguments. This tells the compiler that those arrays do not overlap, and may help it generate more efficient code.
  • Change the way matrices are kept in memory. Instead of an array of pointers pointing to disjoint memory blocks, use a single array M*N elements in size.

So, to put it together, this is how the function should look like. This is portable C99. Using the compiler-specific SIMD intrinsics, this could be made WAAAAY faster.

UPDATE: Note that I changed the way input matrices are defined. A matrix is a single, large array.

float fnFrequentFunction(const float *restrict x, const float *restrict y,
                         const float *restrict z, const float *restrict a,
                         const float *restrict b, float *restrict c, int n)
{
    float ret = 0;
    const float *restrict yy = y; //for readability
    const float *restrict zz = z; // -||-

    for (int i = 0; i < n; i++, yy += M, zz += M)  // n == 1, 2, 4, or 8
    {
        float sum = 0;
        float sum2 = 0;

        for(int j = 0; j < 10; j += 2)
        {
            float tmp  = x[j]   - yy[j];   sum  += tmp  * tmp  * zz[j];
            float tmp2 = x[j+1] - yy[j+1]; sum2 += tmp2 * tmp2 * zz[j+1];
        }
        sum += sum2;

        ret += (c[i] = logf(a[i] * b[i]) + sum);
    }
    return ret;
}
slacker
I will look at the profiling result. Yes, the log() function is a problem compared to other lines.
Peter Lee
I will try it out, and let you know the result.
Peter Lee
+1  A: 

Use memoization to cache the results. This is a time/space trade-off optimization.

It's really easy to do this in Perl with the memoize package, and probably in many other dynamic languages. In C, you'd need to roll your own.

Use a wrapper function to make a hash of the arguments and use it to check if the value has already been calculated. If it has, return it. If not, pass through to the original function and cache the returned result.

Alternatively, you could pre-calculate your lookup table at program startup, or even calculate it once and then persist it, depending on your needs.

Duncan
Can you make it more clear?
Peter Lee
@Peter Lee: Expanded. Hope that helps more :)
Duncan
If the function is called very frequently and with differing values (as would likely be the case with floating-point inputs), this will do nothing but waste time and memory.
San Jacinto
@San Jacinto: Sure. Though the OP asked for *any* suggestions, and I did say depending on his needs. Parts of the algorithm may be memoizable even if the whole thing is not.
Duncan
A: 

The above suggestions of strength reducing the tmp values out of the loop are correct. I might even consider dropping those 10 lines into a for loop of their own as this may improve code cache efficiency.

Beyond this, you start getting to the point where you want to know what type of processor you are targetting. If it has native SIMD support, an FPU, what kind of cache it uses, etc. Also depending on how many arguments get passed via registers, reducing the parameters by combining into a single struct and pass by reference might get you a small boost. Declaring vars as register may or may not help. Again profiling and examining the assembler output will answer that.

As sum is known at before the loop, you could get away with adding M * its value in after the loop for a boost. That just leaves the 2 log muls on the inside.

If M is always 8 or has some other known pattern, you could do some minor loop unrolling, but the gains there are almost nil against the log calls.

The only other major thing to look at is log(). How is this implemented? Can you perhaps roll your own faster version through table lookups if your input range is known. Better yet, table the log products if there's enough RAM available.

Just a few thoughts.

Michael Dorgan
A: 

Do you use compiler optimization?

Register before variables is antiqued for modern compilers. You can even harm the compiler performance if you use them with compiler optimization. For example gcc simple compilation provides:

Time used: 8720000

and without register floats:

Time used: 8710000

I know this is not much.

I assume you made all those sums to avoid a for loop because you think that is much slower. It is not. A modern compiler will do that optimization for you too.

One big optimization I think is to use a table for log, if you don't mind the memory, that will be faster, use log only when you are out of range.

vladv
A: 

I wonder if doing it as scaled ints rather than floats might speed it up. I dont know the data ranges so I dont know if this is even possible

pm100
A: 

In addition to Andrey's answer, you can add some prefetching to the loop:

float fnFrequentFunction(const float* x, const float* y, const float* z,
                         const float *a, const float *b, float *c, int M)
{
    register float tmp;
    register float sum;
    register float ret = 0;
    int i;
    sum = 0;

    tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
    tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
    tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
    tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
    tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
    tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
    tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
    tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
    tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
    tmp = x[9] - y[9]; sum += tmp * tmp * z[9];

    for (i = 0; i < M; i++)  // M == 1, 2, 4, or 8
    {
        //----------------------------------------
        // Prefetch data into the processor's cache
        //----------------------------------------
        float a_value = a[i];
        float b_value = b[i];
        float c_value = 0.0;

        //----------------------------------------
        // Calculate using prefetched data.
        //----------------------------------------
        c_value = log(a_value * b_value) + sum;
        c[i] = c_value;
        ret += c_value;
    }

    return ret;
}

You could also try unrolling the loop:

float a_value = 0.0;
float b_value = 0.0;
float c_value = 0.0;
--M;
switch (M)
{
    case 7:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 6:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 5:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 4:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 3:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 2:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 1:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 0:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    break;
}

Looking at the unrolled version, you could take the " + sum" out of the "loop" and add it in at the end as: ret += (M + 1) * sum; since sum doesn't change.

Finally, another alternative is to perform all multiplications at once, followed by all log calculations, then sum up everything:

float product[8];
for (i = 0; i < M; ++i)
{
  product[i] = a[i] * b[i];
}
for (i = 0; i < M; ++i)
{
  c[i] = log(product);
  ret += c[i];
}
ret += M * sum;
Thomas Matthews
`sum` is added to the values stored in `c`, so it can't be removed from the loop.
outis
I feel terrible that I made a mistake in the previous function. Please see the updated one. Sorry, folks.Yes. (sum) cannot be removed from the loop, because the array c needs that.
Peter Lee
Unfortunately, this is all WRONG. The data is small enough to fit in the L1 cache, so there is no need for prefetching. Unrolling is useless here, as this is FP-heavy code, with integer ALUs slacking the whole time. Loop overhead is essentially FREE here. Unrolling can only hurt performance, by wasting L1 code cache space.
slacker
A: 

If you are calling this multiple times when a and b have not changed, then combine a and b into logab where logab[i] = log(a[i] * b[i]) since a and b are not used anywhere else.

drawnonward
A: 

This appears to be a gaussian mixture model computation. Several years ago, I worked on an effort to optimize this same algorithm which was being used as part of a speech processing program. I investigated a number of optimizations like you're attempting to do but never found anything using straight C to gain more than just a few percent. My biggest gain came from recoding the basic GMM kernel using SIMD instructions. Since that still wasn't providing the performance I was looking for, the next (and final) step was to use an Nvidia GPU. This sort of worked but programming that thing was a headache in itself.

Sorry I can't be more helpful but I don't think you are going to pick up more than just a nominal amount of speed if you're sticking to a regular CPU.

sizzzzlerz