In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom to whatever value is calculated for numer by the thread in the block with the largest threadIdx.x, rather than the sum of the numer value calculated across all the threads in the block. Does anyone know what is going on?
extern __shared__ float s_shared[];
float numer = //calculate numerator
s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();
float denom = s_shared[blockDim.x];
float result = numer/denom;
"result" should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x is the maximum, and some other value not confined to the range for the other threads in the block.