Large data sets representation in C/C++

views:

184

answers:

+3 Q:

Large data sets representation in C/C++

What is the best way to represent the following data for subsequent parallel computations:

A set of quadruples (approximately 20,000,000) of integers that need to be acessible by the first three elements of a quadruple as indexes?

The computation is supposed to be done with MPI in C/C++.

UPD: I should also emphasize that I have two similar datastructures described above with the only difference that the first one is static and the second is dynamic. To be precise, the fourth element of each tuple in the second structure should be computed.

Based on the comments I'm now inclined to employing C++'s vectors and hash them by the first three values. I guess I need to create a hashmap. How do I do it in C++?

What kind of a system are you planning on running this on?

Can the whole thing fit into memory, or is there an io/caching issue that will need to be addressed?

How many bytes per integer?

At 32bits, you're looking at (20M*4*4) ~305MB of data, which you could easily fit into RAM of a dedicated system, or conceivably so for a multi-purpose PC.

If you have the best possible hardware circumstances fit the whole thing into a contiguous block of RAM. A vector of these quads can be radix-sorted in O(N) time. From there indexing into the array would be very fast.

fbrereto 2010-03-08 19:27:03

@fbrereto: The initial datafile is ~500MB so it fits into memory.

Alex 2010-03-08 19:38:31

As commenters suggest (or as I understand it) they suggest hashing first three values and using them as key in some hashmap.

Gabriel Ščerbák 2010-03-08 19:28:44

You can also try B or B* trees and store the data in file.

Gabriel Ščerbák 2010-03-08 19:31:12

Since the first structure is read-only and the second is only accessed via one thread (it sounds like) you shouldn't have to worry about concurrency issues.

If you know that the the three parts of the index will be grouped in a "small" range of integer values, you can use a (possibly nested) vector with some unused memory and just use direct indexing. This has the advantage of being quite fast but won't work if the indexes can cover a large range integer values.

Alternately if you have a wide range of key values you can use a map, hashmap, or sorted vector. Map would be easy to use but has per-node memory overhead. Similarly a hashmap will offer great lookup time but again have memory overhead. A sorted vector would still offer O(log n) looups without the per-node overhead of a map.

Mark B 2010-03-08 20:09:46

@Mark B: The three parts of the index range from 100000:999999, 10:99, 100:999.

Alex 2010-03-08 20:36:45

+2 A:

This sounds like pointwise data in a 3D space, basically. There are many solutions to that problem, and the choice of best one depends on the range and distribution of your indices, and on your data access patterns. The latter is particularly important -- are you randomly selecting a set of values as your key and looking to see if there exists a data quad there, or are you accessing them in a more regular fashion? Different data structures offer very different costs for regular and random accesses.

For sake of description, I'll call your data quads {X, Y, Z, W}, where {X, Y, Z} is your key, and W is the value associated with that key.

If you've got a rectangular range Xmin-to-Xmax, Ymin-to-Ymax, Zmin-to-Zmax, and this range is densely populated such that every X, Y, and Z in that range has a data quad associated with it, you simply use a 3D array indexed by X, Y, and Z, with a W stored at each point in that array.

If you've got something sort of like that except that only some of the values have data quads associated with them, but the fraction is reasonably large (say, 25% or more), then you can still use a 3D array, and at each point in that array you either store a W value or "nothing". If you need to be able to answer the question of whether an X, Y, Z triplet is in your data set, you either store an impossible W value (-1, perhaps, if they're otherwise positive integers, or INT_MAX if they're otherwise finite), or at each point you store a struct of a W integer and a boolean "is_present" flag, and set the flag to true/false for whether that index is present in your data set.

If your data quads are more sparse than that but the indices still fall within a reasonable range, you can use a structure called an octree to represent the data set. Wikipedia has a writeup with diagrams here: http://en.wikipedia.org/wiki/Octree. Basically, you divide the range of possible indices up into 8 octants. If there are only a few data quads in that octant, you store a list of them; otherwise, you recursively divide that octant into 8 sub-octants, and repeat. Eventually you get this tree of octants and suboctants, and each leaf of the tree is a small list of data quads. Even though locating a single point in the tree is expensive (you have to traverse the tree down from the top), it's cheap to locate nearby neighbors, cheap to locate multiple points in the same space, and really cheap to iterate over all the points in the tree.

Brooks Moses 2010-03-08 22:08:25

ansaurus

tags:

views:

answers:

Large data sets representation in C/C++

related questions