ansaurus

Question

Answer 1

+1 A:

In CUDA contiguous (not random) memory access is preferred due to possible use of memory coalescing. It's not a big deal to create array of randomly distributed indexes and proceed one index from A per thread, something like this:

__global__ kernel_func(unsigned * A, float * S)
{
    const unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
    const unsigned S_idx = A[idx];

    S[S_idx] *= 5; // for example...
    ...
}

But memory access to S[random access] will be very slow (here will be a most possible bottleneck).

If you decide to use CUDA, then you must experimenting a lot with blocks/grid sizes, minimize register consumption per thread (to maximize number of blocks per multiprocessor) and maybe sort A to use nearest S_ind from nearest threads...

KoppeKTop 2010-08-15 08:05:10

Answer 2

+1 A:

if you sort your indexes or build the list sorted that will help performance allot, if there are clusters of indexes then try using texture memory, and if you are accessing a number of elements from each thread with some over lap the i found using the shared memory gives a significant performance boost.

Eri 2010-08-15 15:55:34

ansaurus

tags:

views:

answers:

Sparse array in CUDA or OpenCL

related questions