tags:

views:

213

answers:

2

I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.

What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I've seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?

+1  A: 

In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It's not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:

__global__ kernel_func(unsigned * A, float * S)
{
    const unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
    const unsigned S_idx = A[idx];

    S[S_idx] *= 5; // for example...
    ...
}

But memory access to S[random access] will be very slow (here will be a most possible bottleneck).

If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads...

KoppeKTop
+1  A: 

if you sort your indexes or build the list sorted that will help performance allot, if there are clusters of indexes then try using texture memory, and if you are accessing a number of elements from each thread with some over lap the i found using the shared memory gives a significant performance boost.

Eri