tags:

views:

116

answers:

1

So I read a bit about CUDA and GPU programming. I noticed a few things such that access to global memory is slow (therefore shared memory should be used) and that the execution path of threads in a warp should not diverge.

I also looked at the (dense) matrix multiplication example, described in the programmers manual and the nbody problem. And the trick with the implementation seems to be the same: Arrange the calculation in a grid (which it already is in case of the matrix mul); then subdivide the grid into smaller tiles; fetch the tiles into shared memory and let the threads calculate as long as possible, until it needs to reload data from the global memory into shared memory.

In case of the nbody problem the calculation for each body-body interaction is exactly the same (page 682):

bodyBodyInteraction(float4 bi, float4 bj, float3 ai) 

It takes two bodies and an acceleration vectors. The body vector has four components it's position and the weight. When reading the paper, the calculation is understood easily.

But what is if we have a more complex object, with a dynamic data structure? For now just assume that we have an object (similar to the body object presented in the paper) which has a list of other objects attached and the number of objects attached is different in each thread. How could I implement that without having the execution paths of the threads to diverge?

I'm also looking for literature which explains how different algorithms involving more complex data structures can be effectively implemented in CUDA.

+2  A: 

your question is somewhat general, so here a couple general approaches:

  • pad your data so that all objects of the same size (for example zeros)

  • reorganize your similar data to fit in warps, divergence happens only inside warp, not across multiple warp

  • thread in warp always synchronized. If possible and necessary, partition your data into powers of 2 less than 33, to avoid explicit synchronization.

  • it's okay to have some divergence

  • use C++ templates, if it helps, to help guide kernel flow.

aaa
Well I guess just go to sciencedirect.com and type in cuda is a good start. At least I read some interesting papers since asking this questions. Thx for your answer.
Nils