views:

1480

answers:

7

Working with a program that uses 16bytes 4v4 one byte matrices :

unsigned char matrix[4][4];

and a few 256bytes 16v16 one byte matrices:

unsigned char bigMatrix[16][16];

Very often due to data manipulation I am forced to loop column wise in the program making cache misses.

Will the performance improve if I use an array instead, i.e.

unsigned char matrix[16];
unsigned char matrix[256];

and access the elements by using some variables to retrieve elements, i.e.

matrix[variableA*variableB + i];

where variableA*variableB+i needs to be recalculated every time I want to access an element.

I only want speed optimization and memory is of no problem. Will this help, as in give some performance hit or loss, or is the difference too small to even care ?

+14  A: 

It makes no difference. The data is laid out in the exact same way in either case, and is accessed in the same way too. I'd be surprised if it didn't generate exactly the same assembly, even.

However, with a 256byte table, you're unlikely to get cache misses in any case. The CPU's L1 cache is typically between 32 and 128KB, so I doubt you're getting many cache misses in any case.

jalf
+1 it's only a problem for much larger buffers
+1  A: 

You say that variableA*variableB+i needs to be recalculated every time you access an element, well this happens in any case even when using multidimensional arrays. The only difference is that in multidimensional arrays the compiler emits this code so you don't see it and in the one dimensional array you see the code in your source.

Motti
A: 

The big linear array can be slightly faster if you do sequential access to the array because you save a multiply operation at every index. If you're looping column-wise then you are accessing sequentially; at least in [row][col] notation, which has been "standard" with everyone I've ever spoken to.

I doubt your 256 element array will cause cache misses on modern hardware, but I am willing to be proven wrong. What does cachegrind tell you?

Adam Hawes
You can use pointers to iterate over either array in the same way.
strager
+1  A: 

While the compiled code will behave equally fast, there is some design issue: reuse of the indexation code could be maximized.

The best way to do so, imho, is wrapping it up in a container that knows how to loop over it's elements in the fastest way. They got a name for this: an 'internal iterator', as mentioned in the GoF Design Patterns "Iterator" pattern.

A brief example:

 template< int N >
 struct CNxN { 
     typedef int t_row[N];
     typedef t_row t_matrix[N];
     t_matrix m_Contents; 

     template< typename Functor >
     void each( Functor & f ) {
         for( int col = 0; col != N; ++col )
             for( int row = 0; row != N; ++row )
                 f( m_Contents[row][col] );
     }
 };

 // client code
 CNxN<3> matrix = { { {1,1,1},{1,1,1},{1,1,1} } };

 struct sum { 
      long result; 
      sum():result(0){} 
      void operator()( int i ){ result +=i; } 
 };
 matrix.each( sum );
 assert(sum.result==0); 
 assert(has_performed_in_the_fastest_possible_way);//;)
xtofl
Use pointers instead of indexing in your each function for a speed boost. May be optimized by compilers, though. Would be nice to provide stl iterators so stl algorithms can be used on your matrix (though some of them may not be useful).
strager
Indeed: stl-like iterators would be nicer. Matthew Wilson wrote a nice book about that (must read!). You're right about the pointers, too; I primarily wanted to show the reuse style.
xtofl
+3  A: 

jalf is mostly right. L1 cache is divided up into chunks, the size of the chunks is dependant on the processor but is in the order of 32 bytes. So, if you were stepping through memory a byte at a time, you'd get a cache miss every 32 bytes (or whatever the chunk size is). Now, the Intel chip is quite clever here and can detect sequential reads and prefetch the data reducing the impact of a cache miss.

The 4x4 matrix is highly likely to reside in a single L1 chunk (or cache line), so accessing it by row or by column makes little difference. Of course, you don't want to split the matrix across two cache lines, so well aligned memory is important.

The 16x16 matrix, however, won't fit in a cache line. So, if you're skipping through the array processing columns you're going to get lots of cache misses. The computation of the index, as jalf said, makes little difference as the ratio between CPU and memory is high (i.e. you can do a lot of CPU work for each cache miss).

Now, if you are mainly processing the matrix in a column orientated way, then your best option is to transpose all your matrices (swap rows with columns), thus your memory accesses will be more sequential and the number of cache misses will be reduced and the CPU will be able to prefetch data better. So, instead of organising the matrix like this:

  0   1   2 .... 15
 16  17  18 .... 31
....
240 241 242 .... 255

where the number is the memory offset from the start of the matrix, organise thus:

 0 16 32 ... 240
 1 17 33 ... 241
 ...
15 31 47 ... 255

Skizz

Skizz
+1 I was hoping to see the 'tip' about transposing the matrix.
Dave Van den Eynde
You'd only get a cache miss once for each cache line though, and that's pretty much unavoidable. But once all the data has been touched, it'll fit easily inside the cache, so unless you suddenly pull in a lot more data, you won't get cache misses.
jalf
And the computation of the index makes no difference no matter the ratio of CPU/memory, because it's the exact same thing in both cases. A 2d array is indexed in exactly the same way as he showed for a 1d array
jalf
If you're only processing a few matrices then it probably makes no difference if it's row or column ordered. If you're processing thousands of matricies, you should see a better throughput with the transposed matrix as the MMU can spot linear memory access and prefetch it better.
Skizz
A: 

When I was in the school, one of my CS teacher insisted that if you make array for singledimension it will be more faster. That day I was very annoyed...

Jace Jung
A: 

Very often due to data manipulation I am forced to loop column wise [...]

You can't have it both ways: either row-wise or column-wise looping will lead to cache misses if the matrix is 'big enough' (see Skizz' answer). Optimise for the type of looping which gets executed more often.

If memory consumption is not an issue, you might also consider saving both the matrix and its transpose.

Christoph