I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.
Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory.
Some ideas I had so far:
- I tried just keeping it all in sqlite table, but it becomes really really slow once it cannot fit everything in memory.
- Bloom filters sound like they would have very high error rate. I don't mind tiny error rate (64 bit hash gives 1 collision on 4G element set already), but error rates like 1% are a lot too high.
- Keep sorted list of hashes with gaps in a file, and resize when I don't have enough gaps. Hashes are uniformly distributed, so even very simple scheme like that should work.
Am I missing something really obvious? Any hints how to implement good disk-based hashtable?