My kernel only works in block (0,0) | ansaurus

tags:

cuda

views:

39

answers:

1

Q:

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.

This is my invocation code:

dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

This is my Kernel function

__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
        const int tx = threadIdx.x; 
        const int ty = threadIdx.y;
        const int bx = blockIdx.x;
        const int by = blockIdx.y;
        const int row = (by * blockDim.y + ty);
        const int col = (bx * blockDim.x + tx);

        //Pvalue stores the Pd element that is computed by the thread
        int Pvalue = 0;

        for (int k = 0; k < Width; k++)
        {
            Pvalue += Md[row * Width + k] * Nd[k * Width + col];
        }
        __syncthreads();
        //Write the matrix to device memory each thread writes one element
        Pd[row * Width + col] = Pvalue;

    }

I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?

+1 A:

The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.

ZeroDivide 2010-06-09 22:45:44

related questions

CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it ?

How to design an approximate solution algorithm

CUDA compiler (nvcc) macro

CUDA + Visual Studio = suppressed output window

How do you get around the maximum CUDA run-time?

How ugly is the API for GP-GPU?

Compression library using Nvidia's CUDA

CUDA vs FPGA?

CUDA: Wrapping device memory allocation in C++

CUDA memory troubles

Dynamic Allocation of Constant memory in CUDA

Getting array subsets efficiently

How to block until an asynchronous job finishes

CUDA Driver API vs. CUDA runtime

CUDA for .net?

Should I create CUDA apps now, or wait for DirectX 11?

Operations on arbitrary value types

How do I make an already written concurrent program run on a GPU array?

GPGPU VM's: Any open source projects to port virtual machines onto graphics processing units?

Turning C# methods into C++ methods

CUDA global (as in C) dynamic arrays allocated to device memory

Have you successfully used a GPGPU?

How well do common programming tasks translate to GPUs?

raytracing with CUDA

Feasability of GPU as a CPU?