Hi all!
I am having a weird problem .. I have written a CUDA code which executes correctly in emulation and all results show up.. however, when executed on hardware "G210" .. the results in the result memory are always 0
I am passing two vectors to the kernel, one with random variables the other is initialized to zero, the code copies the first vector to shared memory, does some swapping and other operations and then writes back the results on the second vector (the one with the initial 0's)
I am using double precision, the -arch sm13 flag is used, all memory allocation also use sizeof(double) ..
I have checked if the kernel is invoked, it does .. so no problems here .. the cudaMemCpy has no problems ..
what could be the problem .. :( why would it work in emulation but not on HW
I am quite confused .. any ideas?