I'm trying to think of a way to implement the following algorithm using CUDA:
Working on a large volume of voxels, for each voxel I calculate an index i
and a value c
. after the calculation I need to perform histogram[i] += c
c
is a float value and the histogram can have up to 15,000 bins.
I'm looking for a way to implement this efficiently using CUDA. The first obvious problem is that with compute capabilities 1.3 which is what I'm using I can't even do an atomicAdd()
of floats so how can I accumulate anything reliably?
This example by nVidia does something somewhat simpler. The histograms are saved in the shared memory (which I can't do due to its size) and it only accumulates integers. Can this approach be generalized to my case?