ansaurus

Question

Coding a CUDA Kernel that has many threads writing to the same index?

Answer 1

+3 A:

+= is not atomical => not thread safe. Use atomicAdd.

Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.

Andrey 2010-09-14 16:52:32

I'm not sure I understand. atomicAdd is for integers, I'm using floats. Additionally, when you say "individual computation and merging", what is the individual computation referencing in my scenario? The summation? I'm not sure how I could avoid writing to the same cell.

Paul 2010-09-14 17:04:05

@Paul Open B.11.1.1 of NVIDIA CUDA C Programming Guide Version 3.15/28/2010. There is `float` version of atomicAdd. Ok, in your case you don't have individual computations. The code you wrote is not efficient. read more here about how to sum effectively: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

Andrey 2010-09-14 17:44:58

AtomicAdd supports floats but only on later CUDA versions. Prior to CUDA 2.0 only integer AtomicAdd was supported.

Ade Miller 2010-10-01 06:05:54

Answer 2

A:

You need to do a reduction.

Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads

Now reduce the resulting sub totals by adding successive neighboring subtotals:

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

The following deck explains this in some detail:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Sample code for this is here:

http://www.nvidia.com/object/cuda_sample_data-parallel.html

It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).

There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.

Ade Miller 2010-10-01 06:19:13

ansaurus

tags:

views:

answers:

Coding a CUDA Kernel that has many threads writing to the same index?

related questions