CUDA - Better Occupancy vs Less Global Memory Access?

views:

445

answers:

CUDA - Better Occupancy vs Less Global Memory Access?

Hey

My CUDA code must work with (reduce to mean/std, calculate histogram) 4 arrays, each 2048 floats long and already stored in the device memory from previous kernels.

It is generally advised to launch at least as many blocks as I have multiprocessors. In this case however, I can load each of these arrays into the shared memory of a single block and therefore only launch 4 blocks.

This is far from 'keeping the gpu busy' but if I use more blocks I will need to do more interblock communication via global memory and I anticipate any extra utilisation of the multiprocessors will be in vein due to extra extra time spent transferring data in and out of global memory.

Could anyone advise on what is the best way to parallelise in this kind of situation?

Thanks

Interblock communication is not recommended in CUDA. Also, Fermi will support concurrent kernel execution so higher occupancy will become less important in the future. So I would recommend just leaving it with lower occupancy for now unless the performance is unacceptably low.

Eric 2009-11-25 13:40:42

The amount of work you are doing is relatively small, so you should probably stick with four blocks. There is still an advantage of keeping the data local to the GPU for previous/subsequent kernels.

Fermi will allow concurrent kernels and it is exactly this case that stands to benefit the most since you can start the next kernel to occupy the remaining SMs while this kernel is executing. However this does assume that there are no dependencies between the two kernels - naturally you will not be able to start a new kernel that is dependent on the result of the previous kernel before the previous kernel has finished.

Tom 2009-11-25 15:54:02

I don't think you need all the 2048 floats at once, and if you have to reduce you can split the arrays in different parts and then merge the result at the end of the block execution. Can you show some sample code ?

fabrizioM 2009-12-11 11:03:55

This example shows how to compute all the "summary statistics" in a single reduction with Thrust. Another example shows how to compute a histogram using thrust::sort.

wnbell 2010-10-09 02:09:03

ansaurus

tags:

views:

answers:

CUDA - Better Occupancy vs Less Global Memory Access?

related questions