views:

784

answers:

10

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.

What is an implementation of this for Python? Is there a name for this?

Yes, collisions are allowed and can be ignored.

(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)

+1  A: 

How about using an ordinary dictionary and instead of doing:

d[x]=y

use:

d[hash(x)]=y

To look up:

d[hash(foo)]

Of course, if there is a hash collision, you may get the wrong value back.

Mark Byers
Nice. This gets me closer. It's still more than 4 bytes of extra overhead per entry though as apposed to not storing them at all. That'll add up with a billion entries.. BTW, each value will only be 4 bytes.
hashlib.sha512 for example. you can calculate the likelyhood of collisions - for billions it is very very small and weigh up the cost of a collision. it may not even be worth detecting collisions depending on the application
gnibbler
Yes, for this application it is acceptable to just ignore collisions up to a reasonable threshold, something like 1% to 10%.
Array is... ok. This is exactly how I implemented it already using numpy. It's missing a lot of the functionality of a python dictionary though, such as auto-resizing and thousands of times greater speed.
How big do you expect a hash to be? I think they're usually 32 bit unsigned integers. That's how big pointers are, so good luck getting much smaller.With perfect hashing, 32 bit uint is going to give you no collisions for 4 billion values. Any fewer bits and you are guaranteed to have LOTS of collisions.Here's a fast custom hashing algorithm, that at least looks promising, for you to implement in C:http://burtleburtle.net/bob/hash/doobs.html
Merlyn Morgan-Graham
Also, there is no such thing as a perfect hashing algorithm, so you're going to have tons of collisions for "billions" of values unless you have > 4 byte keys/hashes.
Merlyn Morgan-Graham
If the table is going to be a huge overhead, can you just append the 4 byte value to your key datastructure, or write a structure that contains both?
Merlyn Morgan-Graham
A: 

Why not a dictionary + hashlib?

>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
    return hashlib.sha224(obj).hexdigest()

>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}
jbochi
That's over 28 bits of overhead for each 4 byte entry... Won't work.
You can set the digest size for your needs, can't you?
jbochi
Yes, but anything bigger than zero is more than I want, and anything smaller than the hashtable's internal hash size will cause more collisions than necessary. Either way, storing the key is a big waste of space here.
+1  A: 

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.

There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

el.pescado
I've stated that collisions are perfectly fine for this application. Nice idea with the DAWG, unfortunately though my keys are very random.
+2  A: 

Build your own b-tree in RAM.

Memory use:

(4 bytes) comparison hash value

(4 bytes) index of next leaf if hash <= comparison OR if negative index of value

(4 bytes) index of next leaf if hash > comparison OR if negative index of value

12 bytes per b-tree node for the b-tree. More overhead for the values (see below).

How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.

Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.

This assumes a 32bit hash. You will need more bytes per b-tree node if you have greater than 2^31-1 entries or a larger hash.

BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

martinr
+1 for the last paragraph, which was the first response I had to this question. Say you're not storing the actual keys in the table. Now say you have a key-value pair K1-V1 to insert into the hash table, where hash(K1) = H1. You look up H1 in the hash table and find it exists. The problem is that you have no idea whether H1 came from hash(K1) or from hash(K5000), where hash(K5000) also happens to = H1. So you don't even know whether K1-V1 is a collision or not.
shoover
You may very well be on to something. I'm examining this now: pypi.python.org/pypi/blist
To compress the values store, you could give up two extra bits in your "negative index" of value above to indicate whether the number of value matches for the hash is 1,2,3 or 4+. Then all three-value matches could be stored together (in runs of three in a 32bit array so zero overhead), all two-value could be stored together in runs of two with no extra overhead, similarly for 1 value, and for 4+ you could shoehorned together in one array 32bit quantity counts followed by the values. Depends how much memory you want to save / the exact numbers of objects involved.
martinr
+1  A: 

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).

In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

Roberto Liffredo
+1  A: 

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

+2  A: 

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

THC4k
+1  A: 

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))

If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

thouis
In fact, you could grow the numpy array implementation automatically according to how much you're willing to tolerate collisions versus tradeoff. For fewer values, index by hash(key)/(2**K), with K starting at 20, for instance, and decreasing as the table becomes more dense with entries.
thouis
+3  A: 

Bloomier filters - space-efficient associative array

From the Wikipedia:

Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that is in the map.

J.F. Sebastian
I can't find an implementation anywhere?
+1  A: 

Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.

If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:

http://en.wikipedia.org/wiki/Trie

steveha