views:

81

answers:

3

I have an application where I need take the average intensity of an image for around 1 million images. It "feels" like a job for a GPU fragment shader, but fragment shaders are for per-pixel local computations, while image averaging is a global operation.

An image sum will suffice, since it only differs from the average by a constant. Is there a way to tell a fragment shader to add the current pixel value to some global accumulator variable, which I can retrieve by the CPU at the end of the shader program? It seems like adding to an accumulator should be safe in parallel since addition is commutative, as long as the addition is atomic.

One approach I considered is loading the image into a texture, applying a 2x2 box-blur, load the result back into a N/2 x N/2 texture and repeating until the output is 1x1. However, this would take log n applications of the shader; plus lots of copy operations to move memory from the framebuffer into a texture.

Is there a way to do it in one pass? Or are there other shader tricks to do this that I haven't thought of? Or should I just break down and use CUDA? Not sure if it helps, but my images are sparse (90%+ of entries are zero).

+1  A: 

My gut tells me to attempt your implementation in OpenCL. You can optimize for your image size and graphics hardware by breaking up the images into bespoke chunks of data that are then summed in parallel. Could be very fast indeed.

Fragment shaders are great for convolutions but that result is usually written to the gl_FragColor so it makes sense. Ultimately you will have to loop over every pixel in the texture and sum the result which is then read back in the main program. Generating image statistics perhaps not what the fragment shader was designed for and its not clear that a major performance gain is to be had since its not guaranteed a particular buffer is located in GPU memory.

It sounds like you may be applying this algorithm to a real-time motion detection scenario, or some other automated feature detection application. It may be faster to compute some statistics from a sample of pixels rather than the entire image and then build a machine learning classifier.

Best of luck to you in any case!

Pater Cuilus
Thanks for your response. I'll look into OpenCL.
redmoskito
+1  A: 

As it turns out, summing data in hardware isn't as straightforward as it seems. Since GPU code runs in parallel, and summing is inherently sequential, the optimal approach will divide the program into as many parallelizable chunks as possible.

If it must be done using the opengl API (as the original question required), the solution is to render to a texture, create a mipmap of the texture, and read in the 1x1 texture. You have to set the filtering right (bilinear is appropriate, I think), but it should get close to the right answer, modulo precision error.

However, I think cuda or opencl is more appropriate for this task. The approach is called "reduction"; a nice writeup on it is available on the cuda demos page. Basically the idea is what I mentioned in the question comment: take the sum of an element in it's 2x2 neighborhood, and output the result to a N/2 x N/2 array. Repeat until the output is a 1x1 array.

For an NxN array, it will take lg N iterations of the kernel, each requiring 4 memory accesses, for a total of 4 * ln N sequential memory operations. You can use a larger neighborhood, but it is easy to verify that larger neighborhoods will result in a larger number of iterations, and lower performance. As usual in GPGPU programs, the devil is in the details--to get the best performance, you must take steps to maximize shared memory usage, maximize threat utilization, minimize shared memory bank conflicts, etc, etc, etc... I highly recommend reading the PDF associated with the CUDA "reduction" example available at the the cuda demos page.

redmoskito