views:

88

answers:

1

I'm designing a CUDA app to process some video. The algorithm I'm using calls for filling in blank pixels in a way that's not unlike Conway's game of life: if the pixels around another pixels are all filled and all of similar values, the specific pixel gets filled in with the surrounding value. This iterates until all the number of pixels to fix is equal to the number of pixels to fix in the last iteration (ie, when nothing else can be done).

My quandary is this: the previous and next part of the processing pipeline are both implemented in CUDA on the GPU. It would be expensive to transfer the entire image back to RAM, process it on the CPU, then transfer it back to the GPU. Even if it's slower, I would like to implement the algorithm in CUDA.

However, the nature of this problem requires synchronization between all threads to update the global image between each iteration. I thought about just calling the Kernel for each iteration multiple times, but I cannot determine when the process is "done" unless I transfer data back to the CPU between each iteration, which would introduce a large inefficiency because of the memory transfer latency through the PCI-e interface.

Does anyone with some experience with parallel algorithms have any suggestions? Thanks in advance.

+2  A: 

It sounds like you need an extra image buffer, so that you can keep the unmodified input image in one buffer and write the processed output image into the second buffer. That way each thread can process a single output pixel (or small block of output pixels) without worrying about synchronization etc.

Paul R
Thanks for the answer Paul. I don't think I was too clear about the algorithm, but my problem stems from the fact that I need to iterate this process. The first iteration would fill in some pixels, then the second iteration should read the results from the first and fill in some more pixels that could not have been filled in the first iteration. The need for synchronization comes from my need to know the results of all the surrounding pixels before processing the pixel itself.
Xzhsh
OK - but can't you do this with two buffers and toggle backwards and forwards ? A -> B -> A -> B -> ... ? If not then perhaps we need to see more detail of the actual algorithm.
Paul R
It could be implemented like that, if only CUDA allowed some synchronization between threads blocks. Basically the 2nd step would depend on the 1st, and the 3rd would depend on the 2nd. I can only synchronize between threads inside a thread block (a maximum of 512 threads). Without many thread blocks, I'm not taking advantage of CUDA's parallelism, but with many thread blocks, I have no way of guaranteeing that each will finish before iterating except by calling the kernel many times. And if I want to call the kernel many times, I have to transfer data back to the CPU to check if it's done. :|
Xzhsh
If you have two kernels, one for A->B and one for B->A, then you can call each in turn and wait for it complete (`cudaThreadSynchronize()`). That way you get synchronization. [Actually you probably only need one kernel and pass a flag to it to indicate direction (A->B or B->A).]
Paul R
Yup, but I have no idea if the algorithm is complete between each run of the kernel. It would only be done if the kernel says that there are no more pixels to process, but it takes an indeterminate number of steps. I would have to somehow transfer data back between each call of the kernel to check if it's done. If I have to repeat this many times, wouldn't the latency make this approach really expensive?
Xzhsh
OK - that was the part that I was missing - you don't know how many iterations you need to do. You may be able to do something with `threadfence` (assuming CUDA 2.2 or later) and/or atomics, but I haven't really thought this through properly.
Paul R
Ahhh I just looked up __threadfence(), that seems just the thing I need. I think I can probably just implement a while loop with a __threadfence() call at the end. It probably will be slow, so I'll see what I can do with shared memory and creative indexing, but thanks a bunch for the idea
Xzhsh
Cool - glad we got to some kind of possible solution in the end ! Post a comment here when you've tried it and let me know how it works out.
Paul R
Hmm upon further research, __threadfence might still not be enough, as there is no way of guaranteeing each block will complete before rereading... I think I might just have to test an algorithm with only one thread-block versus transfering the memory back and forth. It doesn't seem like inter-block synchronization is possible at all :\ Anyways, I'm going to sleep now, thanks for the ideas again
Xzhsh
OK - it might still be possible using atomics as semaphores but I just tied myself in knots trying to work out how you would actually implement this.
Paul R