tags:

views:

166

answers:

2

Hi all, I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.

__global__ void ADD(float* A, float* B, float* C)
{
   const int ix = blockDim.x * blockIdx.x + threadIdx.x;
   const int iy = blockDim.y * blockIdx.y + threadIdx.y;

   if(ix < 16 && iy < 16)
   {
      for(int i = 0; i<256; i++)
      C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
   }
}

extern "C" void cuda_p(float* A, float* B, float* C)
{
    float* dev_A;
    float* dev_B;
    float* dev_C;
    cudaMalloc((void**) &dev_A,  sizeof(float) * 256);
    cudaMalloc((void**) &dev_B,  sizeof(float) * 256);
    cudaMalloc((void**) &dev_C,  sizeof(float) * 256);
    cudaMemcpy(dev_A, A, sizeof(float) * 256, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_B, B, sizeof(float) * 256, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_C, C, sizeof(float) * 256, cudaMemcpyHostToDevice);
    ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
    cudaMemcpy(A, dev_A, sizeof(float) * 256, cudaMemcpyDeviceToHost);
    cudaMemcpy(B, dev_B, sizeof(float) * 256, cudaMemcpyDeviceToHost);
    cudaMemcpy(C, dev_C, sizeof(float) * 256, cudaMemcpyDeviceToHost);
 cudaFree(dev_A);
 cudaFree(dev_B);
 cudaFree(dev_C);
}
+1  A: 
  1. Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then

    ix = threadIdx.x; iy = blockIdx.x;

    Read about it in CUDA Programming Guide (Appendix B.15).

  2. But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...

KoppeKTop
I would like to use atomic add. But my graphic card GTS8800 does not support. As I had searched some examples and tried on my desktop.
kitw
It doesn't support atomics because of compute capability 1.0 (see Table A-1 in Programming Guide + overview in Appendix G).Maybe you need to look to the algorithm at different angle? 1. Sum all elements in A (e.g. http://bit.ly/bKMjtP) -> `Sum_A`; 2. Sum all elements in B -> `Sum_B`; 3. `Sum_AB = Sum_A + Sum_B`; 4. Map function `C[i] += Sum_AB` to each element in C.
KoppeKTop
Thanks for the reply.I am working on ray-triangle intersection, first I had finished the CUDA kernel on each thread using 1 ray intersect each triangles. But its slow to load triangles all the time. Therefore, I am trying using triangle to intersect all rays, the problem I got is that I can't build an array which is sharing within all threads. like C here
kitw
In that case you could use a technique from n-body simulations or simply read this: http://bit.ly/9TDByC
KoppeKTop
+1  A: 

Your primary issue is that the core of your kernel doesn't make sense. What you have is:

for(int i = 0; i<256; i++)
      C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C

This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:

for(int i = 0; i<256; i++)
      C[ix+iy*16] += A[i] + B[i];

If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:

__global__ void ADD(float* A, float* B, float* C)
{
   const int ix = blockDim.x * blockIdx.x + threadIdx.x;
   const int iy = blockDim.y * blockIdx.y + threadIdx.y;

   if(ix < 16 && iy < 16)
   {
      C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
   }
}

Each thread grabs one entry from A and one from B, and writes one entry in C.

Your secondary issue is that you're launching the kernel wrong. You're doing:

ADDD<<<16,16>>>(dev_A,dev_B,dev_C);

This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:

// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;

// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;

// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);
interfect
Thanks for the reply. It's my problem that i didn't ask correctly.What i am looking is how to manage the shared memory for C, as it will collapse when multiple thread reading same location.
kitw
You can have simultaneous reads from the same location just fine; it's simultaneous *writes* or *updates* that are the problem. Also, what do you mean by "manage the shared memory"? The term "shared memory" has a certain specific meaning in CUDA, and you aren't using it.
interfect