Hey
My CUDA code must work with (reduce to mean/std, calculate histogram) 4 arrays, each 2048 floats long and already stored in the device memory from previous kernels.
It is generally advised to launch at least as many blocks as I have multiprocessors. In this case however, I can load each of these arrays into the shared memory of a single block and therefore only launch 4 blocks.
This is far from 'keeping the gpu busy' but if I use more blocks I will need to do more interblock communication via global memory and I anticipate any extra utilisation of the multiprocessors will be in vein due to extra extra time spent transferring data in and out of global memory.
Could anyone advise on what is the best way to parallelise in this kind of situation?
Thanks