tags:

views:

517

answers:

3

i am doing a research about gpu programming and want to learn more about cuda. i read a lot about it(from Wikipedia+Nvidia and other references) but i still have some questions:

1- is the following true: a gpu has multiprocessors, every multiprocessor have streaming processors, and every streaming processor can run blocks of threads at the same time?

2- all references state that the minimum number of threads to create inside one bock is 32.. why is that?

3- i have an ati radeon video card. and i was able to compile a simple cuda program without emulation mode!!. i thought that i can only compile and run cuda programs on supported Nvidia VGA's. can someone please explain?

+4  A: 

1 - this is true of NVIDIA gpus.

2 - this is a constraint of the hardware design.

3 - compilation is done on the CPU, so you could compile your program much like you could cross-compile for PPC on an x86.

If you want to run gpu programs on an ATI card, I suggest you look at OpenCL or AMD Stream.

goger
this is a constraint of the hardware design.so what happens if i created 2 blocks each with only 10 threads?i have ATI card and i was able to compile and "run" code written in cuda and c without using -deviceemuhow can this work?i also have one question: what is the difference between gpu threads and cuda threads?is this answer correct: cpu runs threads sequentialy unless it has more that 1 core. so a dual core cpu can only run 2 threads simultaneously .a gpu runs a block of threads in parallel becuase it has many streaming processors.
scatman
> 2 blocks / 10 threadsthese will run inefficiently.You just can't run CUDA on ATI. If you want to run CUDA, you need an NVIDIA card.You characterized cpu/gpu threads correctly.
goger
(1) Each SM executes a block, not each SP. (2) This is not a constraint - the minimum number of threads per block is one, your program in this case would function correctly, however for efficient utilisation you should have a multiple of 32 threads per block (and for maximum efficiency you would need reasonable "occupancy" to hide memory latency, 50% is reasonable on GT200 processors). (3) The code is actually executing in emulation mode.
Tom
+3  A: 

A CUDA thread is very lightweight and can be scheduled/stalled with very little penalty. This is unlike a CPU thread which has a lot of overhead to switch in and out of execution. As a result, CPUs are great for task parallelism and GPUs will excel at data parallelism.

  1. In the CUDA architecture a (NVIDIA) GPU has "Streaming Multiprocessors" (SMs), each of which will execute a block of threads. Each SM has a set of Stream Processors (SPs), each of which will be executing instructions for one thread at any given moment (cycle).

  2. Actually the minimum number of threads inside a block is one. If you have just one thread per block, your code will execute correctly. However, it is far more efficient to set up a block such that it has a multiple of 32 threads. This is due to the way the hardware schedules operations across a "warp" which is 32 threads.

  3. You can cross compile your program. You could run it in emulation mode, i.e. the CPU is "emulating" a CUDA GPU, but to run on hardware you would need an NVIDIA GPU (CUDA enabled, anything recent, post 2006 or so, will do).

A high-end current generation GPU has 240 cores (SPs), you could consider this as executing 240 threads at any given moment but it is useful to think of the GPU as executing thousands of threads simultaneously since the state (context) for multiple threads is loaded.

I think it is important to recognise that there are differences between CPU threads and GPU threads. They do have the same name but a GPU thread is lightweight and typically operates on a small subset of the data. Maybe it will help to think of a (set of) CPU thread(s) doing the non-parallel work, then each CPU thread forks into thousands of GPU threads for the data parallel work, then they join back to the CPU thread. Clearly if you can get the CPU thread to do work at the same time as the GPU then that will be even better.

Remember that, unlike a CPU, a GPU is a throughput architecture which means that instead of caches to hide latency, the program should create many threads so that while some threads are waiting for data to return from memory other threads can be executing. I'd recommend watching the "Advanced C for CUDA" talk from the GPU Technology Conference for more information.

Tom
so the number of blocks(grid size) should be equal to the number of SMs while the number of threads should be equal to the number of SPs for best performance. is this right?
scatman
Not quite. The number of threads in a block should be a multiple of the warp size, which is 32 (the number of SPs is 8), ideally this would be more like 128 but this depends on your application.In general, especially when starting out, the number of blocks should be in the hundreds (or thousands). This is because the hardware can schedule multiple blocks on one SM if the resources are available, which in turn means that there are more threads running on the SM. It also means your code will scale smoothly across different devices, now and in the future.
Tom
the link you provided was very helpful thank you.i still have 1 question:does every SP in an SM has its own register and local memory as a hardware? or there is 1 register and 1 local memory that is divided to every SP when a kernel is lunched?and same question for the shared memory. is there a 1 shared memory in every SM. or there is 1 shared memory for all the gpu, and it is logically divided when the kernel is lunched?
scatman
Each SM has a register file and a shared memory, these are discrete units. The register file is logically divided between _threads_ (rather than SPs), and the shared memory is logically divided between _blocks_ running on the SM. The local memory is a little different since it is mostly used for register file spill/fill; it actually lives in the off-chip memory but is reserved (and protected) per-thread.
Tom
+1  A: 
  1. Yes. every GPU is an array of vector processors or SIMD (Single-Instruction Multiple Data) processors. Within a single vector of threads -- which can be 32, 64, or some other number depending on the GPU -- each thread executes the same instruction of your kernel in lock step. This basic unit is sometimes called a "warp" or a "wavefront" or sometimes "a SIMD".

    32 seems to be typical for NVidia chips, 64 for ATI. IIRC, the number for Itel's Larrabee chip is supposed to be even higher, if that chip is ever manufactured.

  2. At the hardware level, threads are executed in these units, but the programming model lets you have an arbitrary number of threads. If your hardware implements a 32-wide wavefront and your program only requests 1 thread, 31/32 of that hardware unit will sit idle. So creating threads in multiples of 32 (or whatever) is the most efficient way to do things (assuming you can program it so that all the threads do usefull work).

    What actually happens in the hardware is there is at least one bit for each thread. that indicates whether the thread is "alive" or not. The extra unused threads in a wavefront of 32 will actually be doing calculations, but but will not be able to write any of the results to any memory location, so it's just as if they never executed.

    When a GPU is rendering graphics for some game, each thread is computing a single pixel (or a sub-pixel if anti-aliasing is turned on), and each triangle being rendered can have an arbitrary number of pixels, right? If the GPU could only render triangles that contained an exact multiple of 32 pixels, it wouldn't work very well.

  3. goger's answer says it all.

  4. Although you didn't specifically ask, it's also very important for you GPU kernels to avoid branches. Since all 32 threads in a wavefront have to execute the same instruction at the same time, what happens when there's and if .. then .. else in the code? If some of the threads in the warp want to execute the "then" part and some want to execute the "else" part? The answer is that all 32 threads execute both parts! Which will obviously take twice as long so your kernel will run at half speed.

Die in Sente
thanks for the additional point:)
scatman