tags:

views:

30

answers:

1

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.

If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 ... has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.

+1  A: 

According to Section G.3.2.1 of the CUDA Programming Guide short will not coalesce on Compute Capability 1.0 and 1.1 devices under any circumstances. Specifically, it states:

The size of the words accessed by the threads must be 4, 8, or 16 bytes

You can however use vector types such as short2, short4, or even short8 to get coalesced access. The coalescing rules for these types is spelled out in Section G.3.2.1 as well. However, as far as coalescing is concerned a short2 is just like a 32-bit int.

FWIW, devices with Compute Capability 1.3 or greater handle types like char and short much better. Reading chars on a 1.3 device might give you as much as ~60% of peak memory bandwidth vs. ~10% of peak memory bandwidth on a 1.0 or 1.1 device.

wnbell