views:

1378

answers:

7

How can I store a hash table with separate chaining in a file on disk?

Generating the data stored in the hash table at runtime is expensive, it would be faster to just load the HT from disk...if only I can figure out how to do it.

Edit: The lookups are done with the HT loaded in memory. I need to find a way to store the hashtable (in memory) to a file in some binary format. So that next time when the program runs it can just load the HT off disk into RAM.

I am using C++.

+5  A: 

What language are you using? The common method is to do some sort binary serialization.

Ok, I see you have edited to add the language. For C++ there a few options. I believe the Boost serialization mechanism is pretty good. In addition, the page for Boost's serialization library also describes alternatives. Here is the link:

http://www.boost.org/doc/libs/1_37_0/libs/serialization/doc/index.html

BobbyShaftoe
> What language are you using? I am using C++. > The common method is to do some sort binary serialization.Can you please elaborate on this? I don't know about binary serialization.I would like to add that this also a learning exercise for me so I would like to do it all by hand. :)
Girish
+3  A: 

Assuming C/C++: Use array indexes and fixed size structs instead of pointers and variable length allocations. You should be able to directly write() the data structures to file for later read()ing.

For anything higher-level: A lot of higher language APIs have serialization facilities. Java and Qt/C++ both have methods that sprint immediately to mind, so I know others do as well.

Ryan Graham
+2  A: 

Perhaps DBM could be of use to you.

Eli Bendersky
Have you checked the licensing on that? Either you make your app open source, or you pay Sleepycat's rather steep (for anyone but a very large corporation) licensing fee.
Ryan Graham
+4  A: 

You could just write the entire data structure directly to disk by using serialization (e.g. in Java). However, you might be forced to read the entire object back into memory in order to access its elements. If this is not practical, then you could consider using a random access file to store the elements of the hash table. Instead of using a pointer to represent the next element in the chain, you would just use the byte position in the file.

Zach Scrivena
Zach,<unrelated question>I am still new to Stack overflow, how can I respond to individual posts? Is it possible/desirable to make a post so that it appears along with other peoples posts(answer) or should I just respond to individual posts via comments ("add comment")?
Girish
@Girish: Welcome to the community =) If your response is brief and specific to the post, then just add a comment. Longer responses that may be useful to others should be in the original question (click "edit"). Never post a response or additional information as a new answer (because it's not one).
Zach Scrivena
Got it, thanks.By a random access file do you mean fseek()'ing the file? ... not sure.
Girish
@Girish: Yes fseek() is correct.
Zach Scrivena
+1  A: 

If your hash table implementation is any good, then just store the hash and each object's data - putting an object into the table shouldn't be expensive given the hash, and not serialising the table or chain directly lets you vary the exact implementation between save and load.

Pete Kirkham
+2  A: 

Ditch the pointers for indices.

This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:

  • Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
  • When chaining is needed use the free space in the table.
  • If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
    • record the index (let's call it N)
    • swap the new element and the squatter
    • put the squatter in a new free index, (F).
    • follow the chain on the squatter's hash index, to replace N with F.
  • If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.

This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.

Anders Eurenius
A: 

Try using STLPLUS, it have API for persistence.

lsalamon