views:

141

answers:

2

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{


  int holo_x = blockIdx.x*20 + threadIdx.x;
  int holo_y = blockIdx.y*20 + threadIdx.y;

  float k=2.0f*3.14f/0.000000054f;

  if (firstTime[0]==1.0f)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f; 
  }

  for (int i=0; i<pointsNumber[0]; i++)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
  }

  __syncthreads(); 


}

and this is function which calls kernel function:

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
 dim3 threadBlockRows(20, 20);

 CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 
 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
 CUDA_SAFE_CALL( cudaThreadSynchronize() );
}

I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.

The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.

When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...

Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?

+1  A: 

Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page

Edric
yes it is connected to my display...
Tome
A: 
  1. It will be better to call CUT_CHECK_ERROR after cudaThreadSynchronize(). Because kernel run asynchronous and you must wait for kernel ending to know about errors... Maybe in second iteration you receive an error from first kernel usage.
  2. Be sure that you have some valid number in the most interesting variable pointsNumber[0] (it might cause a long internal loop).
  3. You could also improve speed of your kernel function:
    • Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
    • Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
    • You don't use shared mem, so __syncthreads() is not required.

PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.

PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function. You could also use bool firstTime - it will be MUCH better then float.

KoppeKTop
I had mistake in pointNumber[0] value, moreover your clues speed up this alghoritm almost 2 times :D thanks :)I have second question.. it is possible that change of gpu (form gtx260 896MB to gtx465 1GB) could casue some mistakes in memory managment? On gtx260 I could allocate float array of 500*500*500 (this is kind of LUT array) floats and now on on gtx when I allocate more than 60*60*60 floats array I get error "unspecified launch error" in place where copy data from host to device memory (this memcopy isn't connected with that LUT array)...?
Tome
It's very hard to say something without code... I had newer saw such error during memcpy. This error is pretty same with "Segmentation fault" in host programs and occurred in kernels. `CUT_CHECK_ERROR` read error message which can be caused by earlier kernel launches (if you don't use `cudaThreadSynchronize`). There are only 4 types async operatioins: 1. kernel launch 2. all cudaMemcpy*Async 3. Memcpy device <-> device 4. Memory initializationSome advice: - Check parameters in `cudaMemcpy`. Right order and right memory allocation (maybe you forgot `cudaMalloc`?) - Use emu mode to dbg.
KoppeKTop
And 1 more guess. Do you add boundary check to kernel? `if (holo_x+holo_y*MAX_FINAL_X >= MAX_FINAL_X*MAX_FINAL_Y) return;`Actually, 2 times - is not a limit )If you will use __constant__ memory for storing A, X, Y and Z + maybe move to fast math (carefully) you'll give some more speedup.
KoppeKTop
Oops! My fault. Check must be ‘if (holo_x >= MAX_FINAL_X || holo_y >= MAX_FINAL_Y) return;‘
KoppeKTop
Incidentally, I'd advocate avoiding cutil (i.e. CUT_CHECK_ERROR). It's not supported by NVIDIA and it is not a great way to handle errors since it exits immediately. The programmer should handle errors explictly in a manner suitable for the application. The cutil macros are used in the SDK samples to hide this to concentrate on the actual algorithms etc. but they're not best practice for real programs.
Tom