tags:

views:

29

answers:

2

say I want to time a memory fetching from device global memory

cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...

kernel_call();
cudaThreadSynchronize();
time2 ...

cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...

I don't understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn't cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.

+1  A: 

The best way to monitor the execution time is to use the CUDA_PROFILE_LOG=1 environment variable, and set in the CUDA_PROFILE_CONFIG file the values, timestamp, gpustarttimestamp,gpuendtimestamp. after running your cuda program with those environment variable a local .cuda_log file should be created and listed inside the timing amounts of memcopies and kernel execution to the microsecond level. clean and not invasive .

fabrizioM