views:

170

answers:

2

I'm trying to think of a way to implement the following algorithm using CUDA:

Working on a large volume of voxels, for each voxel I calculate an index i and a value c. after the calculation I need to perform histogram[i] += c
c is a float value and the histogram can have up to 15,000 bins.

I'm looking for a way to implement this efficiently using CUDA. The first obvious problem is that with compute capabilities 1.3 which is what I'm using I can't even do an atomicAdd() of floats so how can I accumulate anything reliably?

This example by nVidia does something somewhat simpler. The histograms are saved in the shared memory (which I can't do due to its size) and it only accumulates integers. Can this approach be generalized to my case?

A: 

Histogramming directly to external memory using atomicAdd() is going to be very inefficient. Depending on the range of your histogram you might want to consider multiple passes through your source data, updating a partial histogram in shared memory, then write out the partial histogram after each pass. So long as you don't have to make a large number of passes through the input data this should be a lot more efficient. Obviously you would want to make the partial histogram as large as possible, subject to the limits of available shared memory, in order to minimise the number of passes required.

Paul R
+1  A: 

A two step approach will probably be the best. You can create multiple histograms for a subset of voxels and sum them all upfor a global histogram

Assuming you have N voxels, first create global device memory of size M x 15000 (where is M < N but big enough to keep all the cores busy)

Run a cuda kernal to compute the histogram of N/M number of voxels for each thread index. After all the threads are finished, you can now run another cuda kernal that sum the M histogram for your final histogram.

sjchoi
This looks like a promising approach but how do I make sure that no two threads write to the same cell at once?
shoosh
each thread would have it's own histogram that is summed using the 2nd kernel. Only thing you have to worry about is to make sure all the threads are complete before the summing takes place.
sjchoi