views:

52

answers:

2

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance? (case 1)

1st block - 512 threads 2nd block - remaining threads

(case 2) distribute equal number of threads across certain blocks.

+1  A: 

I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)

This link provides some insight into how CUDA will (likely) and organize your threads.

A quote from the summary:

To summarize, special parameters at a kernel launch define the dimensions of a grid and its blocks. Unique coordinates in blockId and threadId variables allow threads of a grid to distinguish among them. It is the programmer's responsibility to use these variables in the kernel functions so that the threads can properly identify the portion of the data to process. These variables compel the programmers to organize threads and there data into hierarchical and multi-dimensional organizations.

KLee1
A: 

It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.

Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.

Jérôme