views:

1378

answers:

+10 Q:

How to store a hash table in a file?

How can I store a hash table with separate chaining in a file on disk?

Generating the data stored in the hash table at runtime is expensive, it would be faster to just load the HT from disk...if only I can figure out how to do it.

Edit: The lookups are done with the HT loaded in memory. I need to find a way to store the hashtable (in memory) to a file in some binary format. So that next time when the program runs it can just load the HT off disk into RAM.

I am using C++.

+5 A:

What language are you using? The common method is to do some sort binary serialization.

Ok, I see you have edited to add the language. For C++ there a few options. I believe the Boost serialization mechanism is pretty good. In addition, the page for Boost's serialization library also describes alternatives. Here is the link:

http://www.boost.org/doc/libs/1_37_0/libs/serialization/doc/index.html

BobbyShaftoe 2009-02-07 18:57:07

> What language are you using? I am using C++. > The common method is to do some sort binary serialization.Can you please elaborate on this? I don't know about binary serialization.I would like to add that this also a learning exercise for me so I would like to do it all by hand. :)

Girish 2009-02-07 19:14:36

+3 A:

Assuming C/C++: Use array indexes and fixed size structs instead of pointers and variable length allocations. You should be able to directly write() the data structures to file for later read()ing.

For anything higher-level: A lot of higher language APIs have serialization facilities. Java and Qt/C++ both have methods that sprint immediately to mind, so I know others do as well.

Ryan Graham 2009-02-07 19:02:33

+2 A:

Perhaps DBM could be of use to you.

Eli Bendersky 2009-02-07 19:05:58

Have you checked the licensing on that? Either you make your app open source, or you pay Sleepycat's rather steep (for anyone but a very large corporation) licensing fee.

Ryan Graham 2009-02-09 14:00:31

+4 A:

You could just write the entire data structure directly to disk by using serialization (e.g. in Java). However, you might be forced to read the entire object back into memory in order to access its elements. If this is not practical, then you could consider using a random access file to store the elements of the hash table. Instead of using a pointer to represent the next element in the chain, you would just use the byte position in the file.

Zach Scrivena 2009-02-07 19:06:37

Zach,<unrelated question>I am still new to Stack overflow, how can I respond to individual posts? Is it possible/desirable to make a post so that it appears along with other peoples posts(answer) or should I just respond to individual posts via comments ("add comment")?

Girish 2009-02-07 19:20:23

@Girish: Welcome to the community =) If your response is brief and specific to the post, then just add a comment. Longer responses that may be useful to others should be in the original question (click "edit"). Never post a response or additional information as a new answer (because it's not one).

Zach Scrivena 2009-02-07 19:31:58

Got it, thanks.By a random access file do you mean fseek()'ing the file? ... not sure.

Girish 2009-02-07 19:41:50

@Girish: Yes fseek() is correct.

Zach Scrivena 2009-02-07 20:07:17

+1 A:

If your hash table implementation is any good, then just store the hash and each object's data - putting an object into the table shouldn't be expensive given the hash, and not serialising the table or chain directly lets you vary the exact implementation between save and load.

Pete Kirkham 2009-02-07 20:05:31

+2 A:

Ditch the pointers for indices.

This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 2¹⁶ or 2²⁴ entries, then I think I would do something like this:

Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
- record the index (let's call it N)
- swap the new element and the squatter
- put the squatter in a new free index, (F).
- follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.

This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.

Anders Eurenius 2009-02-07 23:09:51

Try using STLPLUS, it have API for persistence.

lsalamon 2009-02-09 02:02:38

ansaurus

tags:

views:

answers:

How to store a hash table in a file?

Ditch the pointers for indices.

related questions