I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.
What is the most efficient way to run a CUDA or OpenCL kernel only over elements from S? Can I run a kernel over array A? All examples I've seen so far deal with contiguous 1D, 2D, or 3D arrays. Is there any problem with introducing one layer of indirection?