Hi, I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks
to 1 so I only vary the vector size N
and the number of threads nThreads
. From what I understand, there will be a thread for each multiplication so N
and nThreads
should be equal.
The problem is the following
- I first call the kernel with
N=16
andnThreads<16
which doesn't work. (This is ok) - Then I call it with
N=16
andnThreads=16
which works fine. (Again works as expected) - But when I call it with
N=16
andnThreads<16
it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?