views:

158

answers:

4

I'm dealing with a 2D array with the following characteristics:

const int cols = 500; 
const int rows = 100; 
int arr[rows][cols];

I access array arr in the following manner to do some work:

for(int k = 0; k < T; ++k) { // for each trainee
  myscore[k] = 0;
  for(int i = 0; i < cols; ++i) { // for each sample  
    for(int j = 0; j < rows; ++j) { // for each expert
      myscore[k] += delta(i, anotherArray[k][i], arr[j][i]);
    }   
  }
}

So I am worried about the array 'arr' and not the other one. I need to make this more cache-friendly and also boost the speed. I was thinking perhaps transposing the array but I wasn't sure how to do that. My implementation turns out to only work for square matrices. How would I make it work for non-square matrices?

Also, would mapping the 2D array into a 1D array boost the performance? If so, how would I do that? Finally, any other advice on how else I can optimize this... I've run out of ideas, but I know that arr[j][i] is the place where I need to make changes because I'm accessing columns by columns instead of rows by rows so that is not cache friendly at all.

Thanks, Hristo

A: 

You want memory accesses to be adjacent. In your case simply swap I and j when accessing arr.

Yann Ramin
That doesn't make sense. I'll segfault. I agree that I want to access sequential memory, but I don't think that is the way to do it.
Hristo
@hristo: swap the dimensions of the array too, so swapped (transposed) i and j are valid
Potatoswatter
+2  A: 

A general in-place matrix transposition is very difficult, but if you're okay with transposing it to another array, then it's pretty simple.

const int cols = 500; 
const int rows = 100; 

int arr[rows][cols];
// fill arr[][]

int arrT[cols][rows];
for (int r = 0; r < rows; r++) {
   for (int c = 0; c < cols; c++) {
      arrT[c][r] = arr[r][c];
   }
}

Of course, depending on how you're getting arr[][], you can just fill arrT[][] directly instead.

However, there may be a simpler solution of simple swapping the order of the loops.

for(int k = 0; k < T; ++k) { // for each trainee
  myscore[k] = 0;
  for(int j = 0; j < rows; ++j) { // for each expert
    for(int i = 0; i < cols; ++i) { // for each sample  
      myscore[k] += delta(i, anotherArray[k][i], arr[j][i]);
    }   
  }
}
polygenelubricants
I tried swapping the j and i loops, and unfortunately it runs slower than the other way around, which doesn't make sense.
Hristo
+1  A: 
  for(int i = 0; i < N; ++i) { // for each sample  
    for(int j = 0; j < E[i]; ++j) { // for each expert
      ... arr[j][i] ... // each ++j causes a large stride => poor caching
    }   
  }

transpose the loops:

  for(int j = 0; j < E[i]; ++j) { // for each expert
    for(int i = 0; i < N; ++i) { // for each sample  
      ... arr[j][i] ... // each ++i looks to the next word in memory => good
    }   
  }

Of course, without seeing everything else in the program, I can't say if that would cause a problem. If delta doesn't have side effects, you should be fine.

Potatoswatter
I did this already and it runs slower, which doesn't help. Something I've been hiding is that this runs using Intel's TBB and is parallelized for multiple cores. It should run faster b/c it is more cache friendly, but it isn't :/
Hristo
@hristo: it sounds like you completely misled us as to what the program does and what constrains its performance. Please don't hide things like that in the future.
Potatoswatter
+2  A: 

Yes, 1d should be faster than 2d. C and C++ arrays are always 1d (internally). When you call something like

array[row][col]

the compiler actually calculates

col + row * maxcols

and uses that as the actual index of a 1d array. You might as well do that yourself. Cycling through an entire array will be way faster, and random access will be equally fast as in a 2d array.

mingos
Can you please explain what you mean by "Cycling through an entire array will be way faster, and random access will be equally fast as in a 2d array." I'm not quite following you.
Hristo
Cycling through a 1d array requires no extra calculations, just increment the iterator value and use it as index. Random access wil require the "col+row*maxcols" index value to be calculated.
mingos