So I read a bit about CUDA and GPU programming. I noticed a few things such that access to global memory is slow (therefore shared memory should be used) and that the execution path of threads in a warp should not diverge.
I also looked at the (dense) matrix multiplication example, described in the programmers manual and the nbody problem. And the trick with the implementation seems to be the same: Arrange the calculation in a grid (which it already is in case of the matrix mul); then subdivide the grid into smaller tiles; fetch the tiles into shared memory and let the threads calculate as long as possible, until it needs to reload data from the global memory into shared memory.
In case of the nbody problem the calculation for each body-body interaction is exactly the same (page 682):
bodyBodyInteraction(float4 bi, float4 bj, float3 ai)
It takes two bodies and an acceleration vectors. The body vector has four components it's position and the weight. When reading the paper, the calculation is understood easily.
But what is if we have a more complex object, with a dynamic data structure? For now just assume that we have an object (similar to the body object presented in the paper) which has a list of other objects attached and the number of objects attached is different in each thread. How could I implement that without having the execution paths of the threads to diverge?
I'm also looking for literature which explains how different algorithms involving more complex data structures can be effectively implemented in CUDA.