Performing a scan in OpenCL

I've been trying to get a simple scan to work for quite some time now. For small problems, the output is correct, however for large output, I get the correct results only sometimes. I've checked Apple's OpenCL example and I am basically doing the same thing (except for the bank conflicts, which I'm ignoring atm). So here's the code for the first phase:

__kernel void
scan_init(__global int * input,
          __global int * sums)
{
  int gid = get_global_id(0);
  int lid = get_local_id(0);
  int chunk_size = get_local_size(0)*2;

  int chunk = gid/chunk_size;
  int offset = chunk*chunk_size;

  reduction(input, offset);

  // store sums
  if(lid==0)
  {
    sums[chunk] = input[(chunk+1)*chunk_size-1];
  }

  downsweep(input, offset);
}

And the reduction function itself:

void reduction(__global int * input,
      int offset)
{
 int stride = 1;
 int grp_size = get_local_size(0);
 int lid = get_local_id(0);

 for(int d = grp_size; d > 0; d>>=1)
 {
   barrier(CLK_GLOBAL_MEM_FENCE);

   if(lid < d)
   {
     int ai = stride*(2*lid+1)-1+offset;
     int bi = stride*(2*lid+2)-1+offset;
     input[bi] += input[ai];
   }

   stride *= 2;
  }
}

In the second phase, partials sums are used to build the sum for each element:

void downsweep(__global int * input,
        const unsigned int offset)
{
  int grp_size = get_local_size(0);
  int lid = get_local_id(0);
  int stride = grp_size*2;

  for(int d = 1; d <= grp_size; d *=2)
  {
    barrier(CLK_GLOBAL_MEM_FENCE);

    stride >>=1;

    if(lid+1 < d)
    {
      int src = 2*(lid + 1)*stride-1+offset;
      int dest = src + stride;
      input[dest]+=input[src];
    }
  }
}

The input is padded to a size that is a multiple of the local work size. Each work group can scan a chunk of twice it size. I save the sum of each chunk in the sums array, which I use to check the result. Following is the output for input size 4000 of an array of 1's:

Chunk size: 1024
Chunks: 4
Scan global size: 4096
Local work size: 512
Sum size: 4
0:1024  1:1120  2:2904  3:928

However, expected result would be

0:1024  1:1024  2:1024  3:928

If I run the code again, I get:

0:1056  1:5376  2:1024  3:928 
0:1024  1:1088  2:1280  3:992 
0:5944  1:11156 2:3662  3:1900  
0:7872  1:1056  2:2111  3:1248

The call to the kernel is the following:

clEnqueueNDRangeKernel(cl_ctx->queue, scan_init, 1, NULL, &scan_global_size, &local_work_size, 0, NULL, NULL);

Where global size is 4096 and local size is 512. If I limit the local work group size to 64, the output looks as follows:

0:128  1:128  2:128  3:288  4:128  5:128  6:192  7:192  
8:192  9:254  10:128  11:256  12:128  13:360  14:128  15:128  
16:128  17:128  18:128  19:288  20:128  21:128  22:128  23:128  
24:192  25:128  26:128  27:192  28:128  29:128  30:128  31:32

And if I change the input size to 512 and any chunks size, everything works great!

Finally, when using input size 513 and a group size of 256 (that is, I have two chunks, each having 512 elements, with the second one having only the first element set to 1), the result of the first phase is:

0:1  1:2  2:1  3:6  4:1  5:2  6:1  7:14  
8:1  9:2  10:1  11:6  12:1  13:2  14:1  15:28  
16:1  17:2  18:1  19:6  20:1  21:2  22:1  23:14  
24:1  25:2  26:1  27:6  28:1  29:2  30:1  31:56  
32:1  33:2  34:1  35:6  36:1  37:2  38:1  39:14  
40:1  41:2  42:1  43:6  44:1  45:2  46:1  47:28  
48:1  49:2  50:1  51:6  52:1  53:2  54:1  55:14  
56:1  57:2  58:1  59:6  60:1  61:2  62:1  63:148

Where it should be:

0:1  1:2  2:1  3:4  4:1  5:2  6:1  7:8  
8:1  9:2  10:1  11:4  12:1  13:2  14:1  15:16  
16:1  17:2  18:1  19:4  20:1  21:2  22:1  23:8  
24:1  25:2  26:1  27:4  28:1  29:2  30:1  31:32  
32:1  33:2  34:1  35:4  36:1  37:2  38:1  39:8  
40:1  41:2  42:1  43:4  44:1  45:2  46:1  47:16  
48:1  49:2  50:1  51:4  52:1  53:2  54:1  55:8  
56:1  57:2  58:1  59:4  60:1  61:2  62:1  63:64

My guess is, it's a problem with accessing same data concurrently by different threads, however, this shouldn't be the case, since every work group is processing a different chunk of the input data. Any help on this matter will be greatly appreciated!!

I suspect the problem has to do with barrier() not being an inter-workgroup synchronisation. Each workgroup will have its own barrier, and you have no guarantees about the ordering of workgroups themselves. When you changed the input set size to 512, you may get all your workgroups to run on the same multiprocessor, and therefore incidentally synchronized.

Your chunk variable is get_group_id(0)/2, which means you've got two entire workgroups assigned to the same chunk. You probably want that the other way around. If they happen to run in lockstep, they'll simply overwrite each other's work because their load-store dependencies will match up. Otherwise, they may or may not interfere, always in the direction of summing values multiple times.

A hint in this matter is in your question itself: "Each work group can scan a chunk of twice it size." That should mean a total work size of half the array size is sufficient.

The loop in downsweep() also has an oddity.The first iteration does nothing; lid+1>=1, and d starts as 1. That could be an insignificant superfluous iteration, but it's an off by one in the planning.

Well, now that you've pointed it out, it is pretty obvious! I have kind of forgotten about that detail, when calculating the global work size and the chunks. I've reworked it and now it's working perfectly fine! Thank you very much indeed, for spotting this.As of the 'oddity' in the downsweep-loop: yes, I am aware of the void loop-run, I'll fix it now that everything else is working.

VHristov 2010-07-23 11:02:22

ansaurus

tags:

views:

answers:

Performing a scan in OpenCL

related questions