In the CUDA SDK, there is example code and presentation slides for an efficient one-dimensional reduction. I have also seen several papers on and implementations of one-dimensional reductions and prefix scans in CUDA.
Is there efficient CUDA code available for a reduction of a dense two-dimensional array? Pointers to code or pertinent papers would be appreciated.