ansaurus

Question

Answer 1

A:

Hi,

Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.

What graphics card are you using?

I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.

http://code.google.com/p/cudpp/

shader 2010-07-23 13:30:25

Answer 2

A:

I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.

The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.

Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat

The Computing SDK has lots of examples of reduction techniques.

Mark Borgerding 2010-07-23 15:46:05

Answer 3

A:

Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.

If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/

Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).

Your functor might look something like:

struct ColumnSumFunctor {
    const Matrix matrix;

    // Make a functor to sum the matrix
    InitialGuessFunctor(const Matrix& matrix);

    // Compute and return the sum of the specified column
    __device__
    int operator()(const int& column) const;
};

interfect 2010-07-27 03:28:05

ansaurus

tags:

views:

answers:

CUDA Add Rows of a Matrix

related questions