views:

158

answers:

1

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?

What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?

Thanks,

ExtremeCoder


My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.

Is this correct?

+5  A: 

The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to execute a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().

Edric
Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
Tom