ansaurus

Question

Sparse Multi-Dimensional Data Representation

Answer 1

+4 A:

How often do you add the tissue, and how much time can it take?

One simple solution is using a linked list+hash with pointers from one to the other.

Meaning:

Save a linked list containing all of the relevant points and their data
Save a hash to easily get to this data: key = coordinates, data = pointer to the linked list.

The implementation of the operations would be:
Add a box: Go over the full linked list, and take only the relevant elements into the "work" linked list
Iterate: Go over the "work" linked list
Find neighbors: Seek each of the neighbors in the hash.

Complexity:
Add: O(n), Iterate O(1) for finding next element, Neighbor O(1) average (due to hash).

Anna 2009-08-12 11:33:42

Hi Anna,This seems to be the simplest solution, and therefore the one I'll most likely implement!Looking again at the way the code accesses the current, dense, array, it seems that even the parts that iterate over the medium reference a point using an index function at every step. So, by just implementing your hash, I should be able to get by with just the hash. As you say, the hash will provide constant access time.Thanks,Ross

rossmcf 2009-08-13 09:53:54

Answer 2

+2 A:

If you want to use plain array indexing, you can create a sparse array on POSIX systems using mmap():

float (*a)[500][500];

a = mmap(0, (size_t)500 * sizeof a[0], PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (a && (void *)a != MAP_FAILED)
{
    /* a is now 500 x 500 x 500 sparse array of floats */

You can then just access a[x][y][z] as you like, and it will only actually allocate real memory for each page that's touched. The array will be initialised to zero bytes.

If your system doesn't have MAP_ANONYMOUS you can achieve the same effect by mapping from /dev/zero.

Note that on many systems, swap space will be reserved (though not used) for the whole array.

caf 2009-08-12 14:04:19

Answer 3

+2 A:

First off, I think it's worth considering what your real requirement is. I suspect that it's not just "store the active points and none of the others in as space-efficient a manner as possible", but also a certain amount of "store adjacent points in nearby memory locations so that you get good caching behavior" and "store points in a manner for which lookups can be done efficiently".

With that said, here's what I would suggest. Divide the full 3D region into cubical blocks, all the same size. For each block, store all of the points in the block in dense arrays, including a boolean isTissue array for whether each point is in the tissue region or not. Allocate only the blocks that have points in them. Make a (dense) array of pointers to blocks, with NULL pointers for non-allocated blocks.

Thus, to find the point at (i,j), you first compute ii=i/blockside, jj=j/blocksize, and then look in the pointer-to-block table at (ii,jj) to find the block that contains your point. If that pointer is NULL, your point isn't in the tissue. If it's non-null, you look at (i mod blocksize, j mod blocksize) in that block, and there is your (i,j) point. You can check its isTissue flag to see if it's a "present" point or not.

You'll want to choose the block size as a balance between minimizing the number of times you do adjacent-point computations that cross block boundaries, and minimizing the number of points that are in blocks but not in the tissue region. I'd guess that at a minimum you want a row of the block to be a cache-line long. Probably the optimum is rather larger than that, though it will depend at least somewhat on your geometry.

To iterate over all the points in a 3D box, you would either just do lookups for each point, or (more efficiently) figure out which blocks the box touches, and iterate over the regions in those blocks that are within your box, skipping the ones where isTissue is false.

If you're doing a lot of deallocation and re-allocation of blocks, you probably want to "deallocate" blocks by dropping them into an "unused" pool, and then pulling blocks out of that pool rather than reallocating them. This also has the advantage that those blocks will already have all of their points set to "not present" (because that's why you deallocated the block), so you don't need to initialize them.

The experienced reader will probably recognize similarities between this and ways of laying out data for parallel computations; if you have a really big simulation, you can easily distribute the blocks across multiple nodes, and you only need to do cross-node communication for the cross-block computations. For that sort of application, you may find it useful to do nested levels of blocks, where you have meta-blocks (for the cross-node communication) containing smaller blocks (for the geometry).

Brooks Moses 2009-08-12 19:11:53

As an addendum, if your geometry is strongly anisotropic -- i.e., something made up of mostly-aligned strands or plates -- you might want non-cubical blocks.

Brooks Moses 2009-08-12 19:14:07

Hi Brooks,As it happens, the majority of access to the medium doesn't require neighbours. I suspect that an increase in the access time for those points will have little effect on the overall run time, so long as it remains constant. Thanks for your answer. It helped me to clarify the problem I'm trying to solve and made me think about it differently. Much appreciated.

rossmcf 2009-08-13 10:24:44

ansaurus

tags:

views:

answers:

Sparse Multi-Dimensional Data Representation

related questions