views:

54

answers:

2

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.

So here is the kernel code, and I'll try to explain it a bit clearer with the variables.

__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
 if(index_in < cLength)
 {

  sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
  //__threadfence();
  __threadfence_block();

 }

}

First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).

As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.

The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).

You can see that I added __threadfence (and threadfence_block in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?

EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions

+3  A: 

+= is not atomical => not thread safe. Use atomicAdd.

Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.

Andrey
I'm not sure I understand. atomicAdd is for integers, I'm using floats. Additionally, when you say "individual computation and merging", what is the individual computation referencing in my scenario? The summation? I'm not sure how I could avoid writing to the same cell.
Paul
@Paul Open B.11.1.1 of NVIDIA CUDA C Programming Guide Version 3.15/28/2010. There is `float` version of atomicAdd. Ok, in your case you don't have individual computations. The code you wrote is not efficient. read more here about how to sum effectively: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
Andrey
AtomicAdd supports floats but only on later CUDA versions. Prior to CUDA 2.0 only integer AtomicAdd was supported.
Ade Miller
A: 

You need to do a reduction.

Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads

Now reduce the resulting sub totals by adding successive neighboring subtotals:

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

The following deck explains this in some detail:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Sample code for this is here:

http://www.nvidia.com/object/cuda_sample_data-parallel.html

It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).

There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.

Ade Miller