views:

1304

answers:

1

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).

At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:

__shared__ float A[34*N]

Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.

The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.

Is this true, if so how can I avoid it?

Everytime I allocate shared memory, does it start at a new bank? So potentially could I do

__shared__ Apair1[34]
__shared__ Apair2[34]
...

Any ideas?

Thanks

+2  A: 

If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.

In other words if you have:

A[0]  <- mat1 element1
A[1]  <- mat1 element2
A[2]  <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18

And you access this using:

float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];

Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.

In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.

When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.

The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.

Tom
Thanks. If I had just a single pair of matrices (actually these are matrix rows as I am doing a QR decomposition using givens rotations) then there would be no or few conflicts. The problem is I think that subsequent pairs of matrices will now be disaligned to the shared memory banks. In other words, data that belongs to the second pair will not start at the start of a bank, and thus conflicts will occur.
zenna
Having said that, I think my understanding of banks was confused. I thought several 32 bit elements belonged to a single bank, now it seems every single 32 bit element belongs to its own bank. But then I don't understand what the documentation means by 'there are 16 banks 16' as that would equate a total of 64 bytes of shared memory.
zenna
Updated my answer in response...
Tom