ansaurus

Question

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

Answer 1

+2 A:

If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.

In other words if you have:

A[0]  <- mat1 element1
A[1]  <- mat1 element2
A[2]  <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18

And you access this using:

float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];

Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.

In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.

When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.

The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.

Tom 2010-02-17 19:39:13

Thanks. If I had just a single pair of matrices (actually these are matrix rows as I am doing a QR decomposition using givens rotations) then there would be no or few conflicts. The problem is I think that subsequent pairs of matrices will now be disaligned to the shared memory banks. In other words, data that belongs to the second pair will not start at the start of a bank, and thus conflicts will occur.

zenna 2010-02-17 23:42:49

Having said that, I think my understanding of banks was confused. I thought several 32 bit elements belonged to a single bank, now it seems every single 32 bit element belongs to its own bank. But then I don't understand what the documentation means by 'there are 16 banks 16' as that would equate a total of 64 bytes of shared memory.

zenna 2010-02-18 01:00:13

Updated my answer in response...

Tom 2010-02-18 09:41:54

ansaurus

tags:

views:

answers:

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

related questions