I want to use constant memory which will be accessed by all threads across all of my kernels.
The declaration is something like this
extern constant float smooth [8 * 1024];
I am copying data to this variable using
cudaMemcpyToSymbol("smooth", smooth_local, smooth_size, 0, cudaMemcpyHostToDevice);
smooth_size = 7K bytes
It was giving me incorrect output
but when I run it in -deviceemu mode and tried to print the contents of both these variables inside the kernel, I was getting all zeroes for smooth and smooth_local was correct.
I tried printing the output just after cudaMemcpyToSymbol still it was giving me 0's.
Can you anyone throw light on my problem?