ansaurus

Question

comparing Matlab vs CUDA correlation and reduction on a 2D array

Answer 1

+3 A:

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

means that you are launching 64 threads per block, and number of threadblocks equal to 1 more than needed to process pix3 elements. If pix3 is indeed 400, then you are processing 400 elements because you'll launch 7 threadblocks, each of which does 64 points, and 48 of which does nothing.

I'm not too sure what's the problem here.

Also,

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize and nblocks are actually converted to dim3 vectors, so that they have a (x,y,z) value. If you call a kernel with <<64,7>>, that'll be translated to

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

so for each kernel call, the blockIdx has 3 components, the thread id x, y, and z, corresponding to the 3d grid of threads you are in. In your case, since you only have an x component, blockIdx.y and threadIdx.y are all going to be 1 no matter what. So essentially, they're useless.

Honestly, you seem like you should go reread the basics of CUDA from the user manual, because there are a lot of basics you seem to be missing. Explaining it here wouldn't be economical since it's all written down in a nice documentation you can get here. And if you just want to have a faster FFT with cuda, there's a number of libraries you can just download and install on Nvidia's CUDA zone that will do it for you if you don't care about learning CUDA.

Best of luck mate.

PS. you don't need to call cudaThreadSynchronize after each kernel ;)

Xzhsh 2010-08-05 23:35:47

@Xzhsh: thanks for the tips. I understand that I am instantiating only 64 threads per block. I could have rewritten this as dim3 threadsperblock(20,20); dim3 numblocks(pix3/threadsperblock.x, pix3/threadsperblock.y);I wanted to check this using CUDA Occupancy calculator if this was a better method, but I was not sure about how many registers are used and so unsure about the occupancy.my question is regarding the kernel, any suggestions/corrections?

vivekv80 2010-08-06 00:38:20

ansaurus

tags:

views:

answers:

comparing Matlab vs CUDA correlation and reduction on a 2D array

related questions