ansaurus

Question

Answer 1

+1 A:

Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page

Edric 2010-07-12 07:31:58

yes it is connected to my display...

Tome 2010-07-12 20:30:00

Answer 2

A:

It will be better to call CUT_CHECK_ERROR after cudaThreadSynchronize(). Because kernel run asynchronous and you must wait for kernel ending to know about errors... Maybe in second iteration you receive an error from first kernel usage.
Be sure that you have some valid number in the most interesting variable pointsNumber[0] (it might cause a long internal loop).
You could also improve speed of your kernel function:
- Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
- Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
- You don't use shared mem, so __syncthreads() is not required.

PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.

PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function. You could also use bool firstTime - it will be MUCH better then float.

KoppeKTop 2010-07-12 12:43:21

I had mistake in pointNumber[0] value, moreover your clues speed up this alghoritm almost 2 times :D thanks :)I have second question.. it is possible that change of gpu (form gtx260 896MB to gtx465 1GB) could casue some mistakes in memory managment? On gtx260 I could allocate float array of 500*500*500 (this is kind of LUT array) floats and now on on gtx when I allocate more than 60*60*60 floats array I get error "unspecified launch error" in place where copy data from host to device memory (this memcopy isn't connected with that LUT array)...?

Tome 2010-07-12 20:29:45

It's very hard to say something without code... I had newer saw such error during memcpy. This error is pretty same with "Segmentation fault" in host programs and occurred in kernels. `CUT_CHECK_ERROR` read error message which can be caused by earlier kernel launches (if you don't use `cudaThreadSynchronize`). There are only 4 types async operatioins: 1. kernel launch 2. all cudaMemcpy*Async 3. Memcpy device <-> device 4. Memory initializationSome advice: - Check parameters in `cudaMemcpy`. Right order and right memory allocation (maybe you forgot `cudaMalloc`?) - Use emu mode to dbg.

KoppeKTop 2010-07-12 21:09:51

And 1 more guess. Do you add boundary check to kernel? `if (holo_x+holo_y*MAX_FINAL_X >= MAX_FINAL_X*MAX_FINAL_Y) return;`Actually, 2 times - is not a limit )If you will use __constant__ memory for storing A, X, Y and Z + maybe move to fast math (carefully) you'll give some more speedup.

KoppeKTop 2010-07-12 21:20:12

Oops! My fault. Check must be ‘if (holo_x >= MAX_FINAL_X || holo_y >= MAX_FINAL_Y) return;‘

KoppeKTop 2010-07-12 21:54:27

Incidentally, I'd advocate avoiding cutil (i.e. CUT_CHECK_ERROR). It's not supported by NVIDIA and it is not a great way to handle errors since it exits immediately. The programmer should handle errors explictly in a manner suitable for the application. The cutil macros are used in the SDK samples to hide this to concentrate on the actual algorithms etc. but they're not best practice for real programs.

Tom 2010-07-20 09:31:12

ansaurus

tags:

views:

answers:

Timeout in CUDA? / fermi / gtx465

related questions