views:

83

answers:

4

Hi there,

I want to speed up an array multiplication in C99.

This is the original for loops:

for(int i=0;i<n;i++) {
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * x[i];
        }
    }

My boss asked my to try this, but it did not improve the speed:

for(int i=0;i<n;i++) {
        float value = x[i];
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * value;
        }
    }

Have you other ideas (except for openmp, which I already use) on how I could speed up these for-loops? I am using:

gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp

Thanks!

+1  A: 

Right now, each two consecutive internal operations (i.e. total[j]+= w[j][i] * x[i]) write to different locations and read from distant locations. You can possibly gain some performance by localizing reads and writes (thus, hitting more the internal cache) - for example, by switching the j loop and the i loop, so that the j loop is the external and the i loop is the internal.

This way you'll be localizing both the reads and the writes:
- Memory writes will be to the same place for all is.
- Memory reads will be sequential for w[j][i] and x[i].

To sum up:

for(int j=0;j<m;j++) {
    for(int i=0;i<n;i++) {
        total[j]+= w[j][i] * x[i];
    }
}
adamk
+2  A: 

One of the theories is that testing for zero is faster than testing for j<m. So by looping from j=m while j>0, in theory you could save some nanoseconds per loop. However in recent experience this has made not a single difference to me, so I think this doesn't hold for current cpu's.

Another issue is memory layout: if your inner loop accesses a chunk of memory that isn't spread out, but continuous, chances are you have more benefit of the lowest cache available in your CPU.

In your current example, switching the layout of w from w[j][i] to w[i][j] may therefore help. Aligning your values on 4 or 8 bytes boundaries will help as well (but you will find that this is already the case for your arrays)

Another one is loop-unrolling, meaning that you do your inner loop in chunks of, say, 4. So the evaluation if the loop is done, has to be done 4 times less. The optimum value must be determined emperically, and may also depend on the problem at hand (e.g. if you know you're looping a multiple of 5 times, use 5)

mvds
+1 for memory locality. ---- Testing for 0: this helped reliably on older x86 CPU's where `DEC ECX / JECXZ` had less cycles than `INC ECX / CMP ECX, [end] / JL`. Due to pairing general improvements to conditional jumps, and focusing on the core RISC set this is no longer the case. Today, at best, you free a (pseudo-)register / avoid a L1 cache read, things that *might* make a difference under particular circumstances.
peterchen
Thanks for the tip, with the loop-unrolling. I will try this.
Framester
A: 

If this really matters:

  1. Link against a tuned CBLAS library. There are lots to choose from, some free and some commercial. Some platforms already have one on the system.
  2. Replace your code with a call to cblas_dgemv.

This is an extraordinarily well understood problem, and many smart people have written highly-tuned libraries for it. Use one of them.

Stephen Canon
A: 

If you know that x, total and w do not alias each other you can get a fairly measurable boost by rearranging the loop indices and avoiding the write to total[j] each time through the loop:

for(int j=0;j<m;j++) {
    const float * const w_j = w[j];      
    float total_j = 0;
    for(int i=0;i<n;i++)
        total_j += w_j[i] * x[i];
    total[j] += total_j;
}

However, BLAS is the right answer, most of the time for this sort of thing. The best solution will depend on n, m, prefetch times, pipeline depths, loop unrolling, the size of your cache lines, etc. You probably don't want to do the level of optimization that it other people have done under the covers.

Hudson