tags:

views:

31

answers:

1

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:

void barrier(__global uint* scratch) {
  uint nThreads = get_global_size(0);
  atom_inc(scratch);
  /* this loop never terminates */
  while(scratch[0] < nThreads) {
    continue;
  }
}

The idea is that each thread loops until all of them increment that one piece of memory.

However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.

Is the global memory being locally cached? What's going on here?

A: 

Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.

In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.

If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

stevehb