ansaurus

Question

CUDA programming

Answer 1

+1 A:

Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then

ix = threadIdx.x; iy = blockIdx.x;

Read about it in CUDA Programming Guide (Appendix B.15).
But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...

KoppeKTop 2010-07-30 08:03:37

I would like to use atomic add. But my graphic card GTS8800 does not support. As I had searched some examples and tried on my desktop.

kitw 2010-07-30 17:45:47

It doesn't support atomics because of compute capability 1.0 (see Table A-1 in Programming Guide + overview in Appendix G).Maybe you need to look to the algorithm at different angle? 1. Sum all elements in A (e.g. http://bit.ly/bKMjtP) -> `Sum_A`; 2. Sum all elements in B -> `Sum_B`; 3. `Sum_AB = Sum_A + Sum_B`; 4. Map function `C[i] += Sum_AB` to each element in C.

KoppeKTop 2010-07-30 21:01:05

Thanks for the reply.I am working on ray-triangle intersection, first I had finished the CUDA kernel on each thread using 1 ray intersect each triangles. But its slow to load triangles all the time. Therefore, I am trying using triangle to intersect all rays, the problem I got is that I can't build an array which is sharing within all threads. like C here

kitw 2010-07-31 20:49:15

In that case you could use a technique from n-body simulations or simply read this: http://bit.ly/9TDByC

KoppeKTop 2010-08-01 15:47:20

Answer 2

+1 A:

Your primary issue is that the core of your kernel doesn't make sense. What you have is:

for(int i = 0; i<256; i++)
      C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C

This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:

for(int i = 0; i<256; i++)
      C[ix+iy*16] += A[i] + B[i];

If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:

__global__ void ADD(float* A, float* B, float* C)
{
   const int ix = blockDim.x * blockIdx.x + threadIdx.x;
   const int iy = blockDim.y * blockIdx.y + threadIdx.y;

   if(ix < 16 && iy < 16)
   {
      C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
   }
}

Each thread grabs one entry from A and one from B, and writes one entry in C.

Your secondary issue is that you're launching the kernel wrong. You're doing:

ADDD<<<16,16>>>(dev_A,dev_B,dev_C);

This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:

// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;

// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;

// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);

interfect 2010-07-30 22:19:23

Thanks for the reply. It's my problem that i didn't ask correctly.What i am looking is how to manage the shared memory for C, as it will collapse when multiple thread reading same location.

kitw 2010-07-31 20:44:19

You can have simultaneous reads from the same location just fine; it's simultaneous *writes* or *updates* that are the problem. Also, what do you mean by "manage the shared memory"? The term "shared memory" has a certain specific meaning in CUDA, and you aren't using it.

interfect 2010-08-01 06:03:40

ansaurus

tags:

views:

answers:

CUDA programming

related questions