views:

117

answers:

2

In the CUDA SDK, there is example code and presentation slides for an efficient one-dimensional reduction. I have also seen several papers on and implementations of one-dimensional reductions and prefix scans in CUDA.

Is there efficient CUDA code available for a reduction of a dense two-dimensional array? Pointers to code or pertinent papers would be appreciated.

+1  A: 

matrix reduction may be somewhat simpler to implement, because row/column reduction to a vector can be done independently. You can let each thread handle column/row (depending on matrix major dimension orientation) and coalesce memory reads. I doubt you can buy much performance over that without going to texture/constant cache where locality may become important

aaa
So you are suggesting that I use 1D thread blocks, first reduce one dimension of the array, and then reduce the resulting 1D array? One major efficiency problem is that both dimensions in a 2D array are likely smaller than a 1D array's length, e.g. 4096x4096 vs ~16M. With a scheme like the one you describe, you end up with fewer thread blocks, less work per thread, and much lower utilization of the GPU overall compared to the 1D reduction.
Bradford Larsen
@Bradford Larsen 64 threads thread block is optimal size. 4096 size problem would require 64 64xthread blocks, each thread reducing 4096 elements without synchronization. That should be enough to utilize gpu completely and give enough overlay to cover memory traffic.
aaa
Excuse me, but there will be synchronization problems, due to reduce operations between threads in blocks and blocks in grid. And if one will use such a technique with simple reducing operation (such as sum or min...) kernel couldn't utilize entire GPU and bandwidth. There must be more blocks with more smaller threads.
KoppeKTop
+1  A: 

I don't know what exactly the problem you try to solve, but actually you could simply think about 2D array as a long 1D array and use SDK code to reduce operation. Simple arrays in CUDA are just 1D memory blocks with special addressing rules - why wouldn't you take advantage of that opportunity.

KoppeKTop