Im just learning OpenCL and im at the point when trying to launch a kernel. Why is it that the GPU threads are managed in a grid? I'm going to read more about this in detail but it would be nice with a simple explanation. Is it allways like this when working with GPGPU's?
The simple answer is that GPUs are designed to process images and textures that are 2D grids of pixels. When you render a triangle in DirectX or OpenGL, the hardware rasterizes it into a grid of pixels.
This is a common approach, which is used in CUDA, OpenCL and I think ATI stream.
The idea behind the grid is to provide a simple, but flexible, mapping between the data being processed and the threads doing the data processing. In the simple version of the GPGPU execution model, one GPU thread is "allocated" for each output element in a 1D, 2D or 3D grid of data. To process this output element, the thread will read one (or more) elements from the corresponding location or adjacent locations in the input data grid(s). By organizing the threads in a grid, it's easier for the threads to figure out which input data elements to read and where to store the output data elements.
This contrasts with the common multi-core, CPU threading model where one thread is allocated per CPU core and each thread processes many input and output elements (e.g. 1/4 of the data in a quad-core system).
I will invoke the classic analogy of putting a square peg in a round hole. Well in this case the GPU is a very square hole and not as well rounded as GP(general purpose) would suggest. The above explanations put forward the ideas of 2d textures etc. The architecture of the GPU is such that all processing is done in streams with the pipeline being identical in each stream, so the data being processed need to be segmented like that.
one reason why this is a nice api is that typically you are working with an algorithm that has several nested loops. if you have one, two or three loops then a grid of one, two or three dimensions maps nicely to the problem, giving you a thread for the value of each index.
so values that you need in your kernel (index values) are naturally expressed in the api.