Hi, I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
    int idx = threadIdx.x;
    if (idx<N) 
        c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.
The problem is the following
- I first call the kernel with N=16andnThreads<16which doesn't work. (This is ok)
- Then I call it with N=16andnThreads=16which works fine. (Again works as expected)
- But when I call it with N=16andnThreads<16it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?