say I want to time a memory fetching from device global memory
cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...
kernel_call();
cudaThreadSynchronize();
time2 ...
cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...
I don't understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn't cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.