tags:

views:

780

answers:

1

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.

For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.

Thanks!

+4  A: 

Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.

Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}
Tom
I've read through the NVIDIA introductory material, and I still find the examples too complex. I'm looking for an über-simple 1-dimensional example of using local memory to get my feet wet.
splicer
Thanks for adding code in your last edit! I can't seem to get your kernel working though.... How would I use clSetKernelArg() for temp? Do I need to use clCreateBuffer() for temp? Also, there are a few typos in your kernel: "temp * temp" should be "temp[ltid] * temp[ltid]", and a closing brace should be inserted before the last line.
splicer
Running on the CPU under Snow Leopard, I tried clSetKernelArg(kernel, 2, sizeof(cl_float), NULL); but it crashes. Any ideas?
splicer
I corrected the typos - serves me right for typing on ipod. Your clSetKernelArg is not allocating enough memory though, you need space for one cl_float per thread (you have only allocated one float). Try: `clSetKernelArg(kernel, 2, sizeof(cl_float) * local_work_size[0], NULL);` where `local_work_size[0]` is the work group size in dimension 0.
Tom
Thanks! Looks like you're missing a semicolon on line 11. On the CPU, get_local_size(0) returns 1 for me, so shouldn't my use clSetKernelArg work? Is this a bug in Apple's implementation?
splicer
Note that you can declare variables as local with the qualifier `__local`. For example, you could do `__local float values[GROUP_SIZE];` then have each thread write `values[get_local_id(0)] = ...`. Local memory doesn't need to be reached via a pointer passed into the kernel.
Edward Luong