ansaurus

Question

Optimizing for speed - 4 dimensional array lookup in C

Answer 1

+5 A:

Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.

Tuomas Pelkonen 2010-03-13 21:01:42

This was my thought exactly. As the lookup array is 26^4 it won't fit in L1/L2 cache.

Tiago 2010-03-13 22:34:20

If there are a lot zeros in the lookup_array, you could try to use a tree instead of an array. I guess 26 comes from characters a-z?

Tuomas Pelkonen 2010-03-13 22:55:22

Yes, it's the logarithm of frequencies of sample english. And most entries are 0 indeed.

Tiago 2010-03-13 23:03:09

Answer 2

A:

if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...

the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).

register int j, x1, x2, x3, x4;
register unsigned int *p;

p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;

for (j = (len - 3) / 20; j--; ) {
  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = *p++;
  score += lookup_array[x2][x3][x4][x1];
  x2 = *p++;
  score += lookup_array[x3][x4][x1][x2];
  x3 = *p++;
  score += lookup_array[x4][x1][x2][x3];

  /* that's 20 iterations, add more if you like */
}

for (j = (len - 3) % 20; j--; ) {
  x4 = *p++;
  score += lookup_array[x1][x2][x3][x4];
  x1 = x2;
  x2 = x3;
  x3 = x4;
}

2010-03-13 21:03:44

Answer 3

+2 A:

Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).

Kerido 2010-03-13 21:06:31

Answer 4

+8 A:

What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.

nikie 2010-03-13 21:09:16

I will try that. If it were on the local variables that wouldn't work because there would be more conversions involved (e.g. left shifting the register by 16 bits on a short) which would be significant with the number of iterations. As the lookup array is global that's worth a try.

Tiago 2010-03-13 22:41:17

I'm scoring as the accepted answer as it was the simplest change that yielded the greatest performance improvement. Converting the lookup array from unsigned int to a char improved que function runtime by 20-22%. I guess as the size of the array was reduced by a factor of 4 it is having fewer cache misses as proposed.

Tiago 2010-03-14 09:14:46

Answer 5

+2 A:

The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.

Everything depends on which order do you use for accessing the data, namely on the content of the input array.

The only think you can do is work on locality: read this one, it should give you some inspiration.

By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.

Good luck

Dacav 2010-03-13 21:09:23

Thanks, I will read your reference. I agree that seems to be the way.

Tiago 2010-03-13 22:49:14

+1 for mentioning that it's a single dimension array anyway.

Matt Joiner 2010-03-14 04:38:30

Answer 6

+2 A:

A few suggestions to improve performance:

Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
Reorder data to improve locality. Try sorting input first, for example.
Use streaming processing instructions if the compiler is not already doing it.

About reordering, it would be possible if you flatten the array and use linear coordinates instead.

Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.

fortran 2010-03-13 21:14:25

On the third point: this would lead to non-portable code. However I don't think this is a problem for OP, since he claims it's running on a IA32 cpu.

Dacav 2010-03-13 21:23:15

+1 I would also add 'vectorize'

Kerido 2010-03-13 21:23:21

Parallelising will be the way to go if I can't improve the function. I can't sort input as it is scoring the elements in their actual order (4 items at a time). In case you're wondering, it's trying to identify plaintext with log tetragraphs.

Tiago 2010-03-13 22:46:27

@Kerido the third option is vectorizing ;-)

fortran 2010-03-13 22:51:17

@Tiago Hmmm... completely true, I didn't realised that sorting isn't possible in the actual shape of the problem. I'm going to update the answer...

fortran 2010-03-13 22:54:22

You're assuming the problem is CPU-bound, right? Because if it's memory bound (i.e. the memory bus is the bottleneck), optimizing for CPU instructions just means that the CPU will spend more time waiting for the memory bus.

nikie 2010-03-14 07:57:11

@nikie Why do you think I'm suggesting to reorder data to improve locality?

fortran 2010-03-14 19:41:39

Answer 7

A:

You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.

EvilTeach 2010-03-13 21:54:46

Answer 8

+1 A:

I have a couple of suggestions:

unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */

unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;

for(i = 0; i < (len - 3); i++, a++, b++, c++, d++) 
    score += lookup_array[*a][*b][*c][*d];

return(score)
}

Or try

for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++) 
    score += lookup_array[*a][*b][*c][*d];

Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.

Andrew McGregor 2010-03-13 21:57:51

I like the walking variable values technique +1.

EvilTeach 2010-03-13 22:08:15

Just tried your suggestion. It is elegant but running time is very similar. Guess the compiler is outsmarting me on the optimizations. I will try an assembly dump to check if any relevant difference is noticed.

Tiago 2010-03-13 22:48:16

Answer 9

A:

Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.

Bye

Giuseppe Guerrini 2010-03-14 00:38:14

Answer 10

A:

If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:

unsigned int get_i_score(unsigned int *input)
{
    unsigned int i = len - 3, score = 0, index;

    index = input[i] * 26 * 26 +
            input[i + 1] * 26 +
            input[i + 2];

    while (--i)
    {
        index += input[i] * 26 * 26 * 26;
        score += lookup_array[index];
        index /= 26 ;
    }

    return score;
}

The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.

(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).

caf 2010-03-14 10:33:58

Answer 11

A:

Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.

Thomas Padron-McCarthy 2010-03-14 11:02:24

Answer 12

A:

Maybe you can eliminate some accesses to the input array by using local variables.

unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */

unsigned int get_i_score(unsigned int *input, unsigned int len) {
    unsigned int i, score, a, b, c, d;

    score = 0;
    a = input[i + 0];
    b = input[i + 1];
    c = input[i + 2];
    d = input[i + 3];
    for (i = len - 3; i-- > 0; ) {
        d = c, c = b, b = a, a = input[i];
        score += lookup_array[a][b][c][d];
    }

    return score;
}

Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.

Roland Illig 2010-03-14 13:46:06

ansaurus

tags:

views:

answers:

Optimizing for speed - 4 dimensional array lookup in C

related questions