views:

164

answers:

3

I was reading Supercomputing for the Masses: Part 5 on Dr.Dobb's and I have a question concerning the author's code for (fast) reversing arrays.

I understand the need to use shared memory but I didn't get the performance gain in the code of reverseArray_multiblock_fast.cu

In reverseArray_multiblock_fast.cu an array element is transfered form global memory to shared memory, and then from shared memory to global memory. I cannot understand why this is better than directly reading an array element from the global memory, and writing it to another index in the global memory.

Could you please explain this to me?

+2  A: 

The answer is: GPUs like reading and writing blocks of aligned consecutive data (in cuda referred to as coalescing) so in the fast implementation you read a block in the ideal way to shared memory then reverse it and only then write it out in a consecutive order. Notice that the thread that does the reordering isn't necessarily the thread that writes the data back to global memory.

Eri
A: 

check out Supercomputing for the Masses: Part 6
it explains everything...

scatman
A: 

You've raised an interesting point, because that article was written back in 2008.

On the original CUDA devices (Compute Capability 1.0 and 1.1) it was critical to access global memory using a "coalesced" pattern. This meant that if thread 0 accesses four bytes at byte address A then consecutive threads 1-15 must access addresses A+4 to A+60 respectively (*1).

The code in the article therefore gets threads 0-15 to read contiguous and increasing addresses, store into shared memory, then read from shared memory in the reverse order and write to contiguous and increasing addresses. As a result both the read and the write from/to global memory conform to the strict coalescing requirements.

Since the article was written, however, newer CUDA devices have been brought out (Compute Capability 1.2 and 1.3 and now 2.0 and 2.1) which perform some degree of automatic coalescing. Specifically, in this case it would be perfectly acceptable to read in one order and write in the reversed order - the hardware recognises that the write is a permutation of a coalesced write and simply reorders it for you.

So in summary, on a device with Compute Capability 1.2 or higher you don't need to stage via shared memory for this particular problem. Shared memory is still invaluable in many other problems, of course!

For more information you should check out the section on memory performance in the CUDA Best Practices Guide (available on the CUDA developer site) and also the advanced CUDA training sessions (for example, this recording).

*1 : Note also that address A must be aligned to a 64-byte boundary.

Tom