views:

462

answers:

12

I have a fitness function that is scoring the values on an int array based on data that lies on a 4D array. The profiler says this function is using 80% of CPU time (it needs to be called several million times). I can't seem to optimize it further (if it's even possible). Here is the function:

unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */

unsigned int get_i_score(unsigned int *input) {
register unsigned int i, score = 0;

for(i = len - 3; i--; ) 
    score += lookup_array[input[i]][input[i + 1]][input[i + 2]][input[i + 3]];

return(score)
}

I've tried to flatten the array to a single dimension but there was no improvement in performance. This is running on an IA32 CPU. Any CPU specific optimizations are also helpful. Thanks

+5  A: 

Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.

Tuomas Pelkonen
This was my thought exactly. As the lookup array is 26^4 it won't fit in L1/L2 cache.
Tiago
If there are a lot zeros in the lookup_array, you could try to use a tree instead of an array. I guess 26 comes from characters a-z?
Tuomas Pelkonen
Yes, it's the logarithm of frequencies of sample english. And most entries are 0 indeed.
Tiago
A: 

if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...

the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).

register int j, x1, x2, x3, x4;
register unsigned int *p;

p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;

for (j = (len - 3) / 20; j--; ) {
  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  /* that's 20 iterations, add more if you like */
}

for (j = (len - 3) % 20; j--; ) {
  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = x2;
  x2 = x3;
  x3 = x4;
}
+2  A: 

Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).

Kerido
+8  A: 

What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.

nikie
I will try that. If it were on the local variables that wouldn't work because there would be more conversions involved (e.g. left shifting the register by 16 bits on a short) which would be significant with the number of iterations. As the lookup array is global that's worth a try.
Tiago
I'm scoring as the accepted answer as it was the simplest change that yielded the greatest performance improvement. Converting the lookup array from unsigned int to a char improved que function runtime by 20-22%. I guess as the size of the array was reduced by a factor of 4 it is having fewer cache misses as proposed.
Tiago
+2  A: 

The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.

Everything depends on which order do you use for accessing the data, namely on the content of the input array.

The only think you can do is work on locality: read this one, it should give you some inspiration.

By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.

Good luck

Dacav
Thanks, I will read your reference. I agree that seems to be the way.
Tiago
+1 for mentioning that it's a single dimension array anyway.
Matt Joiner
+2  A: 

A few suggestions to improve performance:

  • Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
  • Reorder data to improve locality. Try sorting input first, for example.
  • Use streaming processing instructions if the compiler is not already doing it.

About reordering, it would be possible if you flatten the array and use linear coordinates instead.

Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.

fortran
On the third point: this would lead to non-portable code. However I don't think this is a problem for OP, since he claims it's running on a IA32 cpu.
Dacav
+1 I would also add 'vectorize'
Kerido
Parallelising will be the way to go if I can't improve the function. I can't sort input as it is scoring the elements in their actual order (4 items at a time). In case you're wondering, it's trying to identify plaintext with log tetragraphs.
Tiago
@Kerido the third option is vectorizing ;-)
fortran
@Tiago Hmmm... completely true, I didn't realised that sorting isn't possible in the actual shape of the problem. I'm going to update the answer...
fortran
You're assuming the problem is CPU-bound, right? Because if it's memory bound (i.e. the memory bus is the bottleneck), optimizing for CPU instructions just means that the CPU will spend more time waiting for the memory bus.
nikie
@nikie Why do you think I'm suggesting to reorder data to improve locality?
fortran
A: 

You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.

EvilTeach
+1  A: 

I have a couple of suggestions:

unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */

unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;

for(i = 0; i < (len - 3); i++, a++, b++, c++, d++) 
    score += lookup_array[*a][*b][*c][*d];

return(score)
}

Or try

for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++) 
    score += lookup_array[*a][*b][*c][*d];

Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.

Andrew McGregor
I like the walking variable values technique +1.
EvilTeach
Just tried your suggestion. It is elegant but running time is very similar. Guess the compiler is outsmarting me on the optimizations. I will try an assembly dump to check if any relevant difference is noticed.
Tiago
A: 

Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.

Bye

Giuseppe Guerrini
A: 

If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:

unsigned int get_i_score(unsigned int *input)
{
    unsigned int i = len - 3, score = 0, index;

    index = input[i] * 26 * 26 +
            input[i + 1] * 26 +
            input[i + 2];

    while (--i)
    {
        index += input[i] * 26 * 26 * 26;
        score += lookup_array[index];
        index /= 26 ;
    }

    return score;
}

The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.

(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).

caf
A: 

Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.

Thomas Padron-McCarthy
A: 

Maybe you can eliminate some accesses to the input array by using local variables.

unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */

unsigned int get_i_score(unsigned int *input, unsigned int len) {
    unsigned int i, score, a, b, c, d;

    score = 0;
    a = input[i + 0];
    b = input[i + 1];
    c = input[i + 2];
    d = input[i + 3];
    for (i = len - 3; i-- > 0; ) {
        d = c, c = b, b = a, a = input[i];
        score += lookup_array[a][b][c][d];
    }

    return score;
}

Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.

Roland Illig