views:

27

answers:

3

Lets consider two almost identical codes:

First

for (int k=0;k<1000;k++)
{
    for (int i=0;i<600;i++)
    {
         for (int j=0;j<600;j++)
         {
              tab[i][j] = i *j;
         }
    }
 }

Second

for (int k=0;k<1000;k++)
{
    for (int i=0;i<600;i++)
    {
         for (int j=0;j<600;j++)
         {
              tab[j][i] = i *j;
         }
    }
 }

In the second one instead of tab[i][j] we have tab[j][i].
The first code is much faster.

Question
Why is the first code much faster?

My intuition
Is it because when a program tries to access a cell , firstly whole block that contains this cell is move to the cache, and then it is accessed via cache. Since array in memory is represented by consecutive cells, then in first case then in first case there is much less access to the memory than in the second case.

+2  A: 

That's because of cache locality. Processor cache line can hold several array elements at once but only from the adjacents addresses.

In the first case you have more cache hits - when you iterate over the second array index you access adjacent elements. You access some element, the processor loads it and its neighbours into the cache line, the next adjacent accesses produce cache hits - you no longer need to access memory to deal with them.

In the second case when you iterate over the first index you load some element, a line of cache gets filled, but the next access is to an element that is not in the same line. Thie makes the processor load yet another line into the cache. If the cache can't hold all lines at once it has to discard previously loaded lines and reload them later. This greatly increases number of memory accesses and thus increases execution time

sharptooth
+1  A: 

Yes, your theory is correct.

When accessing single elements all over the array, the memory has to be switched in and out of the cache as the entire array is too large to fit in the cache.

When you access the elements sequentially, each memory block only have to go into and out of the cache once. Also as you are only using only the last block in the cache, the previous blocks can be written back to memory when it's most convenient.

Guffa
+2  A: 

As well as the correctly identified issue in the other responses, there is a secondary issue, in that most modern CPUs have automatic prefetching. When a certain number of cache lines are loaded from sequential addresses then automatic prefetching is initiated and further cache lines are speculatively loaded. This can be a big performance win if the effects of DRAM latency are eliminated as a result. If you are accessing memory non-sequentially then you don't get this benefit, and it may even be counter-productive, if prefetching loads cache lines that are not subsequently needed.

Paul R