views:

445

answers:

4

Hey

My CUDA code must work with (reduce to mean/std, calculate histogram) 4 arrays, each 2048 floats long and already stored in the device memory from previous kernels.

It is generally advised to launch at least as many blocks as I have multiprocessors. In this case however, I can load each of these arrays into the shared memory of a single block and therefore only launch 4 blocks.

This is far from 'keeping the gpu busy' but if I use more blocks I will need to do more interblock communication via global memory and I anticipate any extra utilisation of the multiprocessors will be in vein due to extra extra time spent transferring data in and out of global memory.

Could anyone advise on what is the best way to parallelise in this kind of situation?

Thanks

A: 

Interblock communication is not recommended in CUDA. Also, Fermi will support concurrent kernel execution so higher occupancy will become less important in the future. So I would recommend just leaving it with lower occupancy for now unless the performance is unacceptably low.

Eric
A: 

The amount of work you are doing is relatively small, so you should probably stick with four blocks. There is still an advantage of keeping the data local to the GPU for previous/subsequent kernels.

Fermi will allow concurrent kernels and it is exactly this case that stands to benefit the most since you can start the next kernel to occupy the remaining SMs while this kernel is executing. However this does assume that there are no dependencies between the two kernels - naturally you will not be able to start a new kernel that is dependent on the result of the previous kernel before the previous kernel has finished.

Tom
A: 

I don't think you need all the 2048 floats at once, and if you have to reduce you can split the arrays in different parts and then merge the result at the end of the block execution. Can you show some sample code ?

fabrizioM
A: 

This example shows how to compute all the "summary statistics" in a single reduction with Thrust. Another example shows how to compute a histogram using thrust::sort.

wnbell