views:

373

answers:

4

What is the best data structure to store the million/billions of records (assume a record contain a name and integer) in memory(RAM). Best in terms of - minimum search time(1st priority), and memory efficient (2nd priority)? Is it patricia tree? any other better than this?

The search key is integer (say a 32 bit random integer). And all records are in RAM (assuming that enough RAM is available).

In C, platform Linux..

Basically My server program assigns a 32bit random key to the user, and I want to store the corresponding user record so that I can search/delete the record in efficient manner. It can be assumed that the data structure will be well populated.

+3  A: 

Depends.

Do you want to search on name or on integer?

Are the names all about the same size?

Are all the integers 32 bits, or some big number thingy?

Are you sure it all fits into memory? If not then you're probably limited by disk I/O and memory (or disk usage) is no concern at all any more.

Does the index (name or integer) have common prefixes or are they uniformly distributed? Only if they have common prefixes, a patricia tree is useful.

Do you look up indexes in order (gang lookup), or randomly? If everything is uniform, random and no common prefixes, a hash is already as good as it gets (which is bad).

If the index is the integer where gang lookup is used, you might look into radix trees.

Rutger Nijlunsing
A lot of problems can fit in ram. Yesterday I configured a Dell with 96 GB ram for less than 20K Euros
Stephan Eggermont
Is the data dynamic? What priority do you place on speed on insertion/deletion?
William Pursell
+1 for using 'big number thingy'
seth
+1  A: 

my educated guess is a B-Tree (but I could be wrong ...):

B-trees have substantial advantages over alternative implementations when node access times far exceed access times within nodes. This usually occurs when most nodes are in secondary storage such as hard drives. By maximizing the number of child nodes within each internal node, the height of the tree decreases, balancing occurs less often, and efficiency increases. Usually this value is set such that each node takes up a full disk block or an analogous size in secondary storage. While 2-3 B-trees might be useful in main memory, and are certainly easier to explain, if the node sizes are tuned to the size of a disk block, the result might be a 257-513 B-tree (where the sizes are related to larger powers of 2).

pageman
A: 

Instead of a hash you can at least use a radix to get started.

For any specific problem, you can do much better than a btree, a hash table, or a patricia trie. Describe the problem a bit better, and we can suggest what might work

Stephan Eggermont
A: 

If you just want retrieval by an integer key, then a simple hash table is fastest. If the integers are consecutive (or almost consecutive) and unique, then a simple array (of pointers to records) is even faster.

If using a hash table, you want to pre-allocate the hashtable for the expected final size so it doesn't to rehash.

Larry Watanabe
or try a cuckoo hash?
pageman