ansaurus

Question

Answer 1

+3 A:

You have choices:

Use a GPU debugger, i.e. cuda-gdb on Linux or Nexus on Windows
Use cuprintf, which is available for registered developers (sign up here)
Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise)

Regarding your code snippet:

Consider passing the Matrix structs in via pointer (i.e. cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit
You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (c.f. the transposeNew sample in the SDK)

Tom 2010-02-01 08:46:45

Answer 2

+1 A:

by the way..

crick3r 2010-02-09 00:00:26

printf inside CUDA global function.