views:

433

answers:

6

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.

Are there any data structure libraries for C++ (or C)? Something along the line

ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.

Thank you!


Details:

I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should

  • Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
  • Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
  • Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
  • Small time overhead to accessors. (i.e. the complexity isn't modified.)

Assumptions:

  • Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
  • So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.

P.S. Can Boost.intrusive help?

+2  A: 

Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.

CMPH claims to cope with large numbers of keys. However, I have never used it.

There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.

Steve314
+2  A: 

You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size. In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.

You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.

Patrick
The issue with STL data structures in memory mapped files is pointers - you'd have to be sure you can always map the file to the exact same address range. I don't know if this is possible or not (or even if the iPhone *supports* memory-mapped files) - just something to check. If you can scan all the nodes and find their pointers, of course, you could do some kind of fix - but that kind of messing with private internals is very very fragile.
Steve314
@Steve: iPhoneOS do support `mmap(3)` the API call, and I think the kernel, which is almost the same as Mac OS X's, genuinely support it too.
KennyTM
You're right Steve, you have to be sure that the memory mapped file can be loaded at exactly the same memory address as where you created the file.Luckily in Windows you can choose the address using the function MapViewOfFileEx. For iPhone I don't know whether this is possible
Patrick
mmap allows you to specify the address, but there isn't any warranty that you will get it. You *should* get it as long as it is the first thing you do after startup.
jbcreix
You could replace pointers by a template pointer class that keeps the load address of the mmap() as a static member, and the offset relative to the mmap area as a member.
Carsten Kuckuk
A: 

Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.

Steve314
Sounds good, but I need to create the files during runtime. (Updated question to clarify.)
KennyTM
That may rule out my perfect hash answer, but not necessarily datadraw. "Persistent" doesn't mean "precompiled" - it just means there's built-in support for updating/reloading a file.
Steve314
Actually, I think I see the confusion - the code Datadraw generates *doesn't* have your data embedded into it. It just generates code to handle (mutable) data structures.
Steve314
A: 

WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.

I thought this section looked particularly relevant.

If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.

There's certainly unordered set/multiset support (C++ code for hash tables).

Steve314
A: 

Boost intrusive containers and offset pointers look great. Has anybody experienced this?

A: 

Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.

Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.

http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io

Davi