tags:

views:

127

answers:

5

Here is the code,

int array[X][Y] = {0,};

// 1 way to access the data
for (int x = 0; x < X; x++)
  for(int y = 0; y < Y; y++)
    array[x][y] = compute();

// the other way to access the data
for (int y = 0; y < Y; y++)
  for (int x = 0; x < X; x++)
    array[x][y] = compute();

Is it true that the first way is more efficient than the second one since the CPU cache(L1, L2?) optimization? In other words, whether sequential access pattern is preferred even for RAM?

A: 

A famouse expression: "It probably wouldn't make a noticeable difference because of the speed of modern computers."

Alexander Rafferty
Cache misses can make a huge difference, especially in modern computers where CPU speed has far outstripped RAM speed.
Ferruccio
+2  A: 

Yep, the first one is faster. In memory matrix is stored a row after another (row major) so there is a greater chance that the neighboring elements will be in the same page in the virtual memory (entire pages are brought to the cache, so the access time is smaller).

The other approach will generate much more cache misses for the larger matrix.

Klark
A cache does not contain entire pages, it contains cache lines which are considerably smaller than pages. A single cache line ranges from a a few bytes to several hundred. A page is at least a few kilobytes. Regardless of this slip of terminology, the access pattern *is* important in high performance code. Question remains: how big are `X` and `Y` and what are the results of actual benchmarks :).
Pieter
@Pieter, both X and Y could be very large, especially Y.
Thomson Tan
+2  A: 

Measure it.

Sequential access is preffered. Should depend largely on the values of X and Y. For certain choices of X and Y I would expect the difference to be considerable.

You should consider using a container like vector, valarray or boost::matrix. C-style arrays can lead to avoidable and annoying bugs.

Gabriel Schreiber
+3  A: 

Yes. Especially if the row fits within a cache line. If you used the second method and there were sufficiently large enough rows in your array, there is no cache locality and the cache lines would be constantly be thrashed.

Jeff M
+5  A: 

You'll understand this better if you draw a picture of your array in memory:

  Y ->
X xxxxx ...
| xxxxx
v xxxxx
  .
  .

The adress you access will grow linear in Y direction (345, 345+1, 345+2...), but jumps wildly in X direction if Y is large (345, 345+X, 345+X*2). As the cache loads blocks of memory, you'll jump out of them very soon with Y big enough, but will always be in the cache page when walking in Y direction until the cache has to be refreshed.

Also note that this effect can be more extreme when using dynamical allocation. Using the following program with full optimizations gives me the following output (times in seconds)

0.615000
9.878000

EDIT: Other interesting measures:

Replacing the array code with int array[X][Y]; will use stack memory which is limited, so you can't test much bigger X/Y values, but also very fast:

0.000000
0.000000

Using int array[X][Y]; as a global variable will use a block of heap memory and is slow again. So even without dynamical allocation, the first case is much better:

0.929000
8.944000

Using X=1500, Y=1500 shows that the effect is measurable even with smaller arrays:

0.008000
0.059000

EDIT2: Also note there are other possible optimizations to the code as jalf said in a comment to your question. Using this optimization indeed almost doubles the speed (0.453 seconds for X=Y=10000):

// an even faster way to access the array
for (int x = 0; x < X; x++) {
  int* arrayptr = array[x];
  for (int y = 0; y < Y; y++, arrayptr++)
    *arrayptr = x;
}

Code: (note you can also use this to measure your case where the difference shouldn't be that extreme except for big X and Y. Like others already said, measure this and you'll be enlightened).

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define X 10000
#define Y 10000

int main() {

  int** array = new int*[X];

  for (int x = 0; x < X; x++) {
    array[x] = new int[Y];
  }

  double c = clock();  

  // 1 way to access the data
  for (int x = 0; x < X; x++)
    for(int y = 0; y < Y; y++)
      array[x][y] = x;

  printf("%f\n", (clock() - c) / CLOCKS_PER_SEC);

  c = clock();  

  // the other way to access the data
  for (int y = 0; y < Y; y++)
    for (int x = 0; x < X; x++)
      array[x][y] = x;

  printf("%f\n", (clock() - c) / CLOCKS_PER_SEC);

  for (int x = 0; x < X; x++) {
    delete(array[x]);
  }
  delete(array);
}
schnaader
sure you'll notice a difference even if the array isn't "very big". Unless the entire array fits into a single cache line, you're going to get cache misses all over the place if you traverse it column-wise
jalf
@jalf: You're right, did a measure with small values and corrected.
schnaader