ansaurus

Question

Answer 1

+3 A:

You're not synchronizing the summing properly to the blockDim.x location. None of the threads are waiting to see what others have written before adding their sum. Sort of like

Everyone reads zero,
goes home, calculates zero + numer.
Everone writes zero+numer to the memory location

The high threadId wins b/c it has a high likelihood of acting last, I suppose.

What you want to do instead, in order to do a quick sum, is to do a binary sum on s_shared[threadIdx.x]

everyone writes their numer
half the threads calculate sums of pairs and write those to a new location
a quarter of the threads caluclate the sums of pairs of pairs, and write those to a new location
etc
until you just have one thread and one sum

This takes O(n) work and O(log n) time.

rampion 2009-07-01 21:23:13

Just to make a note of this, the logic here is known as a reduction. There are a few samples of this in the cuda sdk. See: cuda-sdk/C/src/reduction/reduction_kernel.cu

sharth 2010-03-05 19:08:23

ansaurus

tags:

views:

answers:

CUDA shared memory array - odd behavior

related questions