views:

80

answers:

3

Hi,

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.

What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.

The trouble is, this is really slow...

Anyone got any suggestions?

Basically, I'm trying to implement MATLAB's sum(...) function.

Here is the code which I've verified works fine, it's just it's really slow:

void reduceRows(Matrix Dresult,Matrix DA)
{
        //split DA into chunks
        Matrix Dchunk;
        Dchunk.h=1;Dchunk.w=DA.h;
        cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));

        Matrix DcolSum;
        DcolSum.h=1;DcolSum.w=1;
        //cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));

        int i;
        for(i=0;i<DA.w;i++)   //loop over each column
        {
                //printf("%d ",i);
                cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
                DcolSum.data=&Dresult.data[i];
                reduceTotal(DcolSum,Dchunk);
        }
        cudaFree(Dchunk.data);
}

Matrix is defined as:

typedef struct{
        long w;
        long h;
        float* data;
}Matrix;

ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.

I'm about to do all this on the CPU if I can't find an answer... ;(

Many thanks in advance,

A: 

Hi,

Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.

What graphics card are you using?

I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.

http://code.google.com/p/cudpp/

shader
A: 

I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.

The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.

Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat

The Computing SDK has lots of examples of reduction techniques.

Mark Borgerding
A: 

Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.

If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/

Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).

Your functor might look something like:

struct ColumnSumFunctor {
    const Matrix matrix;

    // Make a functor to sum the matrix
    InitialGuessFunctor(const Matrix& matrix);

    // Compute and return the sum of the specified column
    __device__
    int operator()(const int& column) const;
};
interfect