tags:

views:

282

answers:

5

Hi.

I'm just starting out learning OpenCL. I'm trying to get a feel for what performance gains to expect when moving functions/algorithms to the GPU.

The most basic kernel given in most tutorials is a kernel that takes two arrays of numbers and sums the value at the corresponding indexes and adds them to a third array, like so:

__kernel void 
add(__global float *a,
    __global float *b,
    __global float *answer)
{
    int gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}

__kernel void
sub(__global float* n,
    __global float* answer)
{
    int gid = get_global_id(0);
    answer[gid] = n[gid] - 2;
}

__kernel void
ranksort(__global const float *a,
         __global float *answer)
{
  int gid = get_global_id(0);
  int gSize = get_global_size(0);
  int x = 0;
  for(int i = 0; i < gSize; i++){
    if(a[gid] > a[i]) x++;
  }
  answer[x] = a[gid];
}

I am assuming that you could never justify computing this on the GPU, the memory transfer would out weight the time it would take computing this on the CPU by magnitudes (I might be wrong about this, hence this question).

What I am wondering is what would be the most trivial example where you would expect significant speedup when using a OpenCL kernel instead of the CPU?

+2  A: 

depends on the definition of trivial. in my opinion it would be matrix matrix product, since it has O(3)/O(2) compute to memory ratio. Algorithms which exhibit similar ratios, are likely to benefit from being competed on GPU.

aaa
Matrix multiplication is far from a trivial application and not the best example of an application that can benefit from GPU computation.
Eric
A: 

After matrix multiplication I would say image convolution (such as blur, denoising etc). Check out AMD's tutorial.

Stringer Bell
+1  A: 

While your kernel is clearly very trivial it can be a useful example, it is completely memory bound since for every element you have two reads and one write, and only one arithmetic operation. There are some instructions to compute the address etc., but all of this amounts to practically nothing compared with the cost of accessing memory.

Assuming the data is already on the GPU, you can benefit from the GPU's very high bandwidth to the memory even for this simple kernel.

Of course, GPUs rely on you having enough threads to hide the memory latency, so your local work group size should be fairly large (say 256 or 512) and your global work group size should be very large (e.g. hundreds of thousands) for this to be effective, but that's kind of the point!

Tom
+2  A: 

if you have a sufficiently large set of matrices you intend to perform linear algebra operations on, or that you're essentially performing the same operation on each element, i would regard that as a trivial example. matrix multiplication, addition, fft's, convolution, etc. you'll see a bit of a speedup without doing much work. now if you want to see the 100x speedups then you need to delve into the memory management and know a fair bit about what's going on behind the scenes.

for getting started, i would recommend starting with pycuda since it is fairly simple to get started since it provides a very high level of abstraction and will allow you to jump in very quickly. check out this course on parallel computing using cuda from the university of illinois http://courses.ece.illinois.edu/ece498/al/ when you are ready to dive in further.

Rayjan
A: 

What is "most trivial" is a matter of opinion, but I would say that computing an image of the Mandelbrot set is a pretty straightforward application using the GPU. Each point is totally independent of every other point, so you can start up a thread for each point and get tremendous speedup. The formula itself that is iterated is a simple quadratic function. I used it as an example in a tutorial that can be found on my blog here, just computing the numbers without even making an image to make it even simpler. Almost any embarrassingly parallel (see Wikipedia entry) problem is a good one to begin with.