Ditch the pointers for indices.
This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:
- Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
- When chaining is needed use the free space in the table.
- If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
- record the index (let's call it N)
- swap the new element and the squatter
- put the squatter in a new free index, (F).
- follow the chain on the squatter's hash index, to replace N with F.
- If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.
This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.