views:

166

answers:

2

Hi all, I have a problem I am trying to tackle that involves a 7 point computational stencil. For those who may not know, this would be a 3D grid, and the 7 points are the n'th point, and the neighbors one point away in the x, y and z directions, both positive and negative (or neighbors to the east/west/north/south and up/down).

So these 6 points plus the 1 additional point i am working on are used in a calculation, and are all stored in a 1 dimensional array.

Assume nx is the width of the cube, and ny is the height. In memory, then, when I am accessing a point in the array All_Points, such as All_points[n], then to get it's neighbors in each direction, I also want to access All_points[n-1], All_points[n+1], All_points[n-nx], All_points[n+nx], All_points[n-nx*ny], and All_points[n+nx*ny].

So my problem with this is that I am getting a ton of cache misses. I can't seem to find any code example that demonstrate how to avoid This problem. Ideally I'd like to split this array back up into it's x, y and z coordinates, such as All_x_points[] but then I run into a problem trying to keep that updated, since All_points[n] changes, and when it does, that means for some other All_points[n'] my x, y or z value will need to be updated with it.

Anyone seen this kind of thing done before?

A: 

7 points? Six defining a spatial coordinate, one defining a length? Are these... Stargate coordinates?

Why not turn your Array of Structures (AOS) into a Structure of Arrays (SOA)?

int point = points_all[i]; // the point you want
Vec2 points_x[point]; // x and y are the neighbours left and right
Vec2 points_y[point]; // x and y are the neighbours up and down
Vec2 points_z[point]; // x and y are the neighbours front and back
knight666
That is what I would like to do - somehow store the neighbors separately, but if you'll see my comment above, the points have to somehow be updated every iteration too, because at some point points_all[i] would become an x neighbor in points_x[point] in a later iteration, correct?
Derek
Also, this code needs to be fast, and I am trying to get rid of pointer accesses as much as possible. The original code I started with does indeed have a structure that holds index values to the correct neighbor locations for each point, but that just transferred my cache misses to the lookup of the pointers instead of in the array lookup as part of the compute
Derek
Are you basically doing a 3D fluid simulation?
knight666
More or less. If you google "7 point computational stencil" there are a ton of papers out there, but no source code. I can't figure out how people are doing it with arrays instead of structures..maybe they aren't. It seems like I should be able to hold the y-minus, y-plus, etc. points in an array, and whenever an update to the n'th point is made, update the appropriate index in the split out arrays, and combine them all at the end of each iteration so that in the kernel of my code it's just a bunch of vector addition
Derek
+1  A: 

What kind of access pattern is using your 7-point stencil? If you're having cache coherence problems, this is the first question to ask -- if the access pattern of your central (x,y,z) coordinate is completely random, you may be out of luck.

If you have some control over the access pattern, you can try to adjust it to be more cache-friendly. If not, then you should consider what kind of access pattern to expect; you may be able to arrange the data so that this access pattern is more benign. A combination of these two can sometimes be very effective.

There is a particular data arrangement that is frequently useful for this kind of thing: bit-interleaved array layout. Assume (for simplicity) that the size of each coordinate is a power of two. Then, a "normal" layout will build the index by concatenating the bits for each coordinate. However, a bit-interleaved layout will allocate bits to each dimension in a round-robin fashion:

3D index coords: (xxxx, yyyy, zzzz)

normal index:    data[zzzzyyyyxxxx]  (x-coord has least-significant bits, then y)
bit-interleaved: data[zyxzyxzyxzyx]  (lsb are now relatively local)

Practically speaking, there is a minor cost: instead of multiplying the the coordinates by their step values, you will need to use a lookup table to find your offsets. But since you will probably only need very short lookup tables (especially for a 3D array!), they should all fit nicely into cache.

3D coords:  (x,y,z)

normal index:      data[x + y*ystep + z*zstep]  where:
  ystep= xsize (possibly aligned-up, if not a power of 2?)
  zsetp= ysize * ystep

bit-interleaved:   data[xtab[x] + ytab[y] + ztab[z]]  where:
  xtab={  0,  1,  8,  9, 64, 65, 72, 73,512...}   (x has bits 0,3,6,9...)
  ytab={  0,  2, 16, 18,128,130,144,146,1024...}  (y has bits 1,4,7,10...)
  ztab={  0,  4, 32, 36,256,260,288,292,2048...}  (y has bits 2,5,8,11...)

Ultimately, whether this is any use depends entirely on the requirements of your algorithm. But, again, please note that if your algorithm is too demanding of your cache, you may want to look into adjusting the algorithm, instead of just the layout.

comingstorm
You are right-on with the normal indexing that you have listed there. That is exactly how my data is lined up. I am not sure of the possible values of xsize or ysize though. Possibly NOT a power of two though. For the "small" test cases I am running xsize seems to be 21. The data in the original array is accessed sequentially, not randomly. So when I update the n'th point, i can very well expect the n+1 point to use n and n+2 as it's x coordinates, etc. In your example, what are those x/y/ztab values? The actual data values? I'm not quite getting how the interleaved data is accessed.
Derek
In my example, the data is assumed to be stored in a large 1-dimensional array. The x/y/ztab values are index offsets, used to calculate the location in data[] that the element is actually stored in. This is probably the fastest way to do bit-interleaving: rather than do manual bit-shifting every time you access an element, you precompute the index offsets for each coordinate, look them up (from very short tables that don't take much cache room), and add them together. (I will answer your other questions in subsequent comments...)
comingstorm
If your access pattern is strictly sequential, in the order that it is stored in memory, this is the best possible case. The 7-point stencil spreads the memory accesses out over 5 locations in the array -- but the data loaded from memory should be used efficiently, *if* your access pattern is in fact in memory-sequential order (i.e., your inner loop should be over your fastest-changing coordinate, the next-inner is over your next-fastest coordinate, etc.). If your data is larger than your cache, you will necessarily have *some* cache misses...
comingstorm
There are a few reasonably simple things you can do to improve your cache performance. First, if your data elements are carrying extra data that is not used by your inner loop, you should separate it out and store it in a separate array, so you are not loading useless data along with the useful data (this is what @knight666 means by SOA).
comingstorm
Second, you could try prefetching your cache data. This is platform-specific and kind of tweaky (i.e., you will need to use compiler intrinsics to do the prefetch, then tune the prefetch lookahead for your specific inner loop), but it can keep your processor from stalling -- the prefetch ensures that the cache fill happens *before* the data is needed.
comingstorm
Finally, you can try using the bit-interleaved data layout, combined with a corresponding bit-interleaved access pattern. The good news is that this will increase the reuse of your cache data. The bad news is that it will presumably change the actual quantitative results, as your elements will be updated in a different sequence! This might not be a bad thing -- for all I know, it may improve your convergence -- but it is certainly something to look out for...
comingstorm
FYI I juist got these answers, and it's the end of the day for me..will respond when I get a chance
Derek