views:

268

answers:

5

What's a correct and good way to implement __hash__()?

I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.

As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions). What's a good practice to get such values? Are collisions a problem? In my case I have a small class which acts as a container class holding some ints, some floats and a string.

+6  A: 

An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here's an example of using a key for hash and equality:

class A(object):
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __eq__(x, y):
        return x.__key() == y.__key()

    def __hash__(self):
        return hash(self.__key())
John Millikin
Hm, I didn't think of that. However, this might lead to huge tuples/keys when the number of attributes that make my object unique is high.
Yes; if your object is very large, then its key will be correspondingly large (and the hash expensive to calculate). If the attributes can be enumerated (eg, columns in an ORM object), then you can simplify `__key()`; however, you'll still have to hash every attribute value. There's not really any way around this.
John Millikin
A: 

Perhaps the answer provided by this StackOverflow post would help?

http://stackoverflow.com/questions/793761/built-in-python-hash-function

Nick Klauer
And how should I combine then hash(a), hash(b)?return hash(a) + hash(b) is certainly not a good idea.Furthermore I want an efficient yet correct solution. I know that for theoretical correctness I could use a SHA-256 checksum, but this is definitely *not* appropriate.
Try (hash(a) * X + hash(b)). Using a smallish prime such as 101 for X will give decent results.
George V. Reilly
Why must X be a prime?
See http://www.partow.net/programming/hashfunctions/#HashAndPrimes
George V. Reilly
A: 

Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.

I would favour bit operations. Like, the following C pseudocode: int a; int b; int c; int d; int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);

Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better. Strings, I've got little/no idea.

DeadMG
I know that there will be collisions. But I have no clue how these are handled. And furhermore my attribute values in combination are very sparsely distributed so I was looking for a smart solution. And somehow I expected there to be a best practice somewhere.
+3  A: 

I can try to answer the second part of your question.

The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.

In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!

Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.

Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.

In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.

Here are some links if you're interested more:

coalesced hashing on wikipedia

Wikipedia also has a summary of various collision resolution methods:

Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.

+1  A: 

Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that

for c in some_string:
    hash = 101 * hash  +  ord(c)

worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.

George V. Reilly
Apparently Java does it the same way but using 31 instead of 101