views:

128

answers:

1

I am trying to compare cross-correlation using FFT vs using windowing method.

My Matlab code is:

isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
  for j = 1:n
    xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
  end
end

similar CUDA kernel:

__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{

    int p = blockIdx.x * blockDim.x + threadIdx.x;
    int q = 0;
    int i = 0;
    int j = 0;
    int summ = 0;

    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
        {
            summ  = 0; //force update
            for(p = 0; p < pix1; ++p)
            {
                for(q = 0; q < pix1; ++q)
                {
                    temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];               
                    sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                    out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                }
            }       
        }
    }

I have called this in my kernel as

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();

Somehow, when I do a diff on the output file, I see that the CUDA kernel computes for only the first 400 elements.

What is the correct way to write this kernel??

Also, what is the difference in declaring i,j as shown below in my kernel??

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;
+3  A: 
int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

means that you are launching 64 threads per block, and number of threadblocks equal to 1 more than needed to process pix3 elements. If pix3 is indeed 400, then you are processing 400 elements because you'll launch 7 threadblocks, each of which does 64 points, and 48 of which does nothing.

I'm not too sure what's the problem here.

Also,

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize and nblocks are actually converted to dim3 vectors, so that they have a (x,y,z) value. If you call a kernel with <<64,7>>, that'll be translated to

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

so for each kernel call, the blockIdx has 3 components, the thread id x, y, and z, corresponding to the 3d grid of threads you are in. In your case, since you only have an x component, blockIdx.y and threadIdx.y are all going to be 1 no matter what. So essentially, they're useless.

Honestly, you seem like you should go reread the basics of CUDA from the user manual, because there are a lot of basics you seem to be missing. Explaining it here wouldn't be economical since it's all written down in a nice documentation you can get here. And if you just want to have a faster FFT with cuda, there's a number of libraries you can just download and install on Nvidia's CUDA zone that will do it for you if you don't care about learning CUDA.

Best of luck mate.

PS. you don't need to call cudaThreadSynchronize after each kernel ;)

Xzhsh
@Xzhsh: thanks for the tips. I understand that I am instantiating only 64 threads per block. I could have rewritten this as dim3 threadsperblock(20,20); dim3 numblocks(pix3/threadsperblock.x, pix3/threadsperblock.y);I wanted to check this using CUDA Occupancy calculator if this was a better method, but I was not sure about how many registers are used and so unsure about the occupancy.my question is regarding the kernel, any suggestions/corrections?
vivekv80

related questions