views:

113

answers:

6

I have a hash table that I want to store to disk. The list looks like this:

<16-byte key                   > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...

There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.

What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?

Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.

+2  A: 

You could just sort entries by key and do a binary search.

Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.

I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.

If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.

Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.

I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.

*Use the header to lookup the offset of a block to decompress and use.

Kevin Montrose
How about this: First I store 256 numbers, each 4 bytes. Each one says how many keys start with a specific prefix. So if I have 10 keys that start with 0x00 and 20 that start with 0x01, the first 8 bytes of the file are 0x0000000a00000014. Then I store the keys sorted but without the first byte. Total storage: 256*4Bytes + N*16Bytes. Compare to N*17Bytes and already I saved a few megs.
Eyal
If your hashes are actually hashes (and accordingly are basically random) you won't see much in the way of savings. Don't neglect the actual disk cost of having multiple files either. If your keys are not hashes, however, you could exploit that to compress the keys and save space.
Kevin Montrose
Just a single file of size 256*4+N*16 compared to N*17. With N>1million, already it's a nice savings! Maybe even better can be done...
Eyal
If you're willing to do that, just go for full on compression. You'll get better ratios. Answer updated accordingly.
Kevin Montrose
+1  A: 

Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.

Nifle
+1  A: 

First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.

Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc. So you will need about log2(entriesNumber) read operations to search single key.

+7  A: 

5 million records it's about 81MB - acceptable to work with array in memory.

As you described problem - it's more unique keys than hash values. Try to use hash table for accessing values (look at this link).

If there is my misunderstand and this is real hash - try to build second hash level above this.

Hash table can be successfuly organized on disk too (e.g. as separate file).

Addition

Solution with good search performance and little overhead is:

  1. Define hash function, which produces integer values from keys.
  2. Sort records in file according to values, produced by this function
  3. Store file offsets where each hash value starts
  4. To locate value:
    4.1. compute it's hash with function
    4.2. lookup for offset in file
    4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.

There are some additional things which must be pointed out:

  • Hash function must be fast to be effective
  • Hash function must produce linear distributed values or near that
  • Table of hash value offsets can be placed in separated file
  • Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
  • at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.

You can find some examples of hash functions here.

ThinkJet
You're right, they are unique keys, not hashes of anything.
Eyal
+1: "acceptable to work with array in memory", with caveat, on desktop / server systems. If an embedded app not so much.
Binary Worrier
+1  A: 

Hi

As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:

-- binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;

-- file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;

-- the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:

-- at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;

-- at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,

-- repeat until you've written all the data.

Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.

Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).

If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.

What I'm describing, not terribly well, is a hash table.

Regards

Mark

High Performance Mark
+1  A: 

Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.

  1. You could make a hash table, or

  2. Use Bentley-style unrolled binary search (at most 24 comparisons), as in

Here's the unrolled loop (with 32-bit ints).

int key[4];
int a[1<<24][4];

#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])

i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
Mike Dunlavey