ansaurus

Question

How Can I Improve/SpeedUp This FrequentFunction in C?

Answer 1

+12 A:

put this outside the loop

sum = 0;

tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
tmp = x[9] - y[9]; sum += tmp * tmp * z[9];

Andrey 2010-04-19 22:20:23

++ Exactly. This needs to be computed once per call.

John at CashCommons 2010-04-19 22:26:55

I guess I was writing it at the same time as you just a little slower.

Romain Hippeau 2010-04-19 22:32:07

It could be a simplified example, though.

slacker 2010-04-19 22:32:23

M is a variable, so putting that code outside the loop is kind of ...., and furthermore, I'm afraid that i won't speed up much. Anyway, it's a good idea. Thx for your reply :-)

Peter Lee 2010-04-19 22:37:24

Sorry for the confusion of the example.

Peter Lee 2010-04-19 22:38:55

@Peter Lee:Notice that this piece of code is completely independent of M. It simply repeats the same calculation on the same data needlessly each iteration. It should be done only once, before the loop.

slacker 2010-04-19 22:44:48

@slacker, I'm sorry that I made a terrible mistake in the code. I will correct it ASAP.

Peter Lee 2010-04-19 22:58:30

@Peter Lee:It's not that terrible ;). The cost of this is relatively small, compared to the `log()`, so the speedup won't be large. If at all, as it is possible that your compiler is smart enough to do this optimization for you automatically.

slacker 2010-04-19 23:08:04

@slacker: I would say it IS a terrible mistake. I updated my code. (y and z are matrices instead of arrays); We are using log(), because the real data is too large.

Peter Lee 2010-04-19 23:50:33

@Peter Lee:Is the `log()` necessary, then? Wouldn't switching to `double` suffice? I would recommend avoiding complex functions like `log()` if only possible, as they are simply **DAMN SLOW**.

slacker 2010-04-20 00:10:33

Answer 2

+2 A:

This function is perfectly amenable to SIMD processing. Look into your compiler documentation for the intrinsic functions that correspond to the SSE instructions.
You could break up the dependence chain on the sum variable. Instead of a single sum accumulator, use two accumulators sum1 and sum2 alternately - one for even, one for odd indices. Add them up afterwards.
The single biggest performance bottleneck here is the log() function. Check if an approximation would be sufficient. The calculation of this could also be vectorized - I believe Intel published a high-performance math library - including vectorized versions of functions like log(). You may like to use this.
You are operating on floats here, and log() uses double precision. Use logf() instead. It may (or may not) be faster. It will certainly be no slower.
If your compiler understands C99, place a restrict qualifier on the pointers which are function arguments. This tells the compiler that those arrays do not overlap, and may help it generate more efficient code.
Change the way matrices are kept in memory. Instead of an array of pointers pointing to disjoint memory blocks, use a single array M*N elements in size.

So, to put it together, this is how the function should look like. This is portable C99. Using the compiler-specific SIMD intrinsics, this could be made WAAAAY faster.

UPDATE: Note that I changed the way input matrices are defined. A matrix is a single, large array.

float fnFrequentFunction(const float *restrict x, const float *restrict y,
                         const float *restrict z, const float *restrict a,
                         const float *restrict b, float *restrict c, int n)
{
    float ret = 0;
    const float *restrict yy = y; //for readability
    const float *restrict zz = z; // -||-

    for (int i = 0; i < n; i++, yy += M, zz += M)  // n == 1, 2, 4, or 8
    {
        float sum = 0;
        float sum2 = 0;

        for(int j = 0; j < 10; j += 2)
        {
            float tmp  = x[j]   - yy[j];   sum  += tmp  * tmp  * zz[j];
            float tmp2 = x[j+1] - yy[j+1]; sum2 += tmp2 * tmp2 * zz[j+1];
        }
        sum += sum2;

        ret += (c[i] = logf(a[i] * b[i]) + sum);
    }
    return ret;
}

slacker 2010-04-19 22:24:10

I will look at the profiling result. Yes, the log() function is a problem compared to other lines.

Peter Lee 2010-04-19 23:53:24

I will try it out, and let you know the result.

Peter Lee 2010-04-20 00:06:29

Answer 3

+1 A:

Use memoization to cache the results. This is a time/space trade-off optimization.

It's really easy to do this in Perl with the memoize package, and probably in many other dynamic languages. In C, you'd need to roll your own.

Use a wrapper function to make a hash of the arguments and use it to check if the value has already been calculated. If it has, return it. If not, pass through to the original function and cache the returned result.

Alternatively, you could pre-calculate your lookup table at program startup, or even calculate it once and then persist it, depending on your needs.

Duncan 2010-04-19 22:37:01

Can you make it more clear?

Peter Lee 2010-04-19 22:41:09

@Peter Lee: Expanded. Hope that helps more :)

Duncan 2010-04-20 00:03:55

If the function is called very frequently and with differing values (as would likely be the case with floating-point inputs), this will do nothing but waste time and memory.

San Jacinto 2010-04-20 01:58:18

@San Jacinto: Sure. Though the OP asked for *any* suggestions, and I did say depending on his needs. Parts of the algorithm may be memoizable even if the whole thing is not.

Duncan 2010-04-20 02:41:03

Answer 4

A:

The above suggestions of strength reducing the tmp values out of the loop are correct. I might even consider dropping those 10 lines into a for loop of their own as this may improve code cache efficiency.

Beyond this, you start getting to the point where you want to know what type of processor you are targetting. If it has native SIMD support, an FPU, what kind of cache it uses, etc. Also depending on how many arguments get passed via registers, reducing the parameters by combining into a single struct and pass by reference might get you a small boost. Declaring vars as register may or may not help. Again profiling and examining the assembler output will answer that.

As sum is known at before the loop, you could get away with adding M * its value in after the loop for a boost. That just leaves the 2 log muls on the inside.

If M is always 8 or has some other known pattern, you could do some minor loop unrolling, but the gains there are almost nil against the log calls.

The only other major thing to look at is log(). How is this implemented? Can you perhaps roll your own faster version through table lookups if your input range is known. Better yet, table the log products if there's enough RAM available.

Just a few thoughts.

Michael Dorgan 2010-04-19 22:42:46

Answer 5

A:

Do you use compiler optimization?

Register before variables is antiqued for modern compilers. You can even harm the compiler performance if you use them with compiler optimization. For example gcc simple compilation provides:

Time used: 8720000

and without register floats:

Time used: 8710000

I know this is not much.

I assume you made all those sums to avoid a for loop because you think that is much slower. It is not. A modern compiler will do that optimization for you too.

One big optimization I think is to use a table for log, if you don't mind the memory, that will be faster, use log only when you are out of range.

vladv 2010-04-19 23:07:48

Answer 6

A:

I wonder if doing it as scaled ints rather than floats might speed it up. I dont know the data ranges so I dont know if this is even possible

pm100 2010-04-19 23:21:10

Answer 7

A:

In addition to Andrey's answer, you can add some prefetching to the loop:

float fnFrequentFunction(const float* x, const float* y, const float* z,
                         const float *a, const float *b, float *c, int M)
{
    register float tmp;
    register float sum;
    register float ret = 0;
    int i;
    sum = 0;

    tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
    tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
    tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
    tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
    tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
    tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
    tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
    tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
    tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
    tmp = x[9] - y[9]; sum += tmp * tmp * z[9];

    for (i = 0; i < M; i++)  // M == 1, 2, 4, or 8
    {
        //----------------------------------------
        // Prefetch data into the processor's cache
        //----------------------------------------
        float a_value = a[i];
        float b_value = b[i];
        float c_value = 0.0;

        //----------------------------------------
        // Calculate using prefetched data.
        //----------------------------------------
        c_value = log(a_value * b_value) + sum;
        c[i] = c_value;
        ret += c_value;
    }

    return ret;
}

You could also try unrolling the loop:

float a_value = 0.0;
float b_value = 0.0;
float c_value = 0.0;
--M;
switch (M)
{
    case 7:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 6:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 5:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 4:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 3:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 2:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 1:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 0:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    break;
}

Looking at the unrolled version, you could take the " + sum" out of the "loop" and add it in at the end as: ret += (M + 1) * sum; since sum doesn't change.

Finally, another alternative is to perform all multiplications at once, followed by all log calculations, then sum up everything:

float product[8];
for (i = 0; i < M; ++i)
{
  product[i] = a[i] * b[i];
}
for (i = 0; i < M; ++i)
{
  c[i] = log(product);
  ret += c[i];
}
ret += M * sum;

Thomas Matthews 2010-04-19 23:49:49

`sum` is added to the values stored in `c`, so it can't be removed from the loop.

outis 2010-04-20 00:02:01

I feel terrible that I made a mistake in the previous function. Please see the updated one. Sorry, folks.Yes. (sum) cannot be removed from the loop, because the array c needs that.

Peter Lee 2010-04-20 00:11:17

Unfortunately, this is all WRONG. The data is small enough to fit in the L1 cache, so there is no need for prefetching. Unrolling is useless here, as this is FP-heavy code, with integer ALUs slacking the whole time. Loop overhead is essentially FREE here. Unrolling can only hurt performance, by wasting L1 code cache space.

slacker 2010-04-20 00:23:00

Answer 8

A:

If you are calling this multiple times when a and b have not changed, then combine a and b into logab where logab[i] = log(a[i] * b[i]) since a and b are not used anywhere else.

drawnonward 2010-04-20 00:26:56

Answer 9

A:

This appears to be a gaussian mixture model computation. Several years ago, I worked on an effort to optimize this same algorithm which was being used as part of a speech processing program. I investigated a number of optimizations like you're attempting to do but never found anything using straight C to gain more than just a few percent. My biggest gain came from recoding the basic GMM kernel using SIMD instructions. Since that still wasn't providing the performance I was looking for, the next (and final) step was to use an Nvidia GPU. This sort of worked but programming that thing was a headache in itself.

Sorry I can't be more helpful but I don't think you are going to pick up more than just a nominal amount of speed if you're sticking to a regular CPU.

sizzzzlerz 2010-04-20 01:20:51

ansaurus

tags:

views:

answers:

How Can I Improve/SpeedUp This FrequentFunction in C?

related questions