ansaurus

Question

What issues do you consider when designing a hash function?

Answer 1

+9 A:

The first issue I consider is whether an established algorithm will suit my requirements.

Adam Liss 2008-11-11 02:56:02

Some questions that may help determine suitability:Are collisions acceptable?Should the hash be reversible?Does the hash need to be continuous?

Jason Hernandez 2008-11-12 07:00:34

Answer 2

+1 A:

i'll second what Adam said: don't reinvent the hashing wheel

the only time i have ever needed a custom hashing function was to compare digraphs for equality; the hashing function let me tell very efficiently when two graphs were not equal (when the hash values matched i still had to do a node-by-node comparison to be certain)

Steven A. Lowe 2008-11-11 02:58:59

Answer 3

+1 A:

Doing data warehouse development. We had a dimension with about 9,000 rows. The queries that were being developed included some really ugly queries.

So, I started analyzing the dimension rows. Dimension rows were clustered based on various combinations of columns. The clustering was a map from some key to a list of dimension rows that shared that key. The hash key, then, was a tuple of column values.

The intermediate result, in Python, looked like this

{ 
    ( (col1, col2), (col3, col4) ) : [ aRow, anotherRow, row3, ... ],
    ( (col1, col2), (col3, col4) ) : [ row1, row2, row3. row4, ... ],
}

Technically, this is an inverted index.

The hash key required some care in building a tuple of column values, partly because Python won't use mutable collections (i.e., lists). More important, the tuples weren't simple flat lists of column values. They were usually two-tuples that attempted to partition the dimension rows into disjoint subsets based on key combinations

The hash algorithm, down deep, is the built-in Python hash. Choosing the keys, however, wasn't obvious or easy.

S.Lott 2008-11-11 03:03:41

Answer 4

+1 A:

The first thing I think of is the best place to steal a hashing algorithm and its code. If and only if I find no algorithm appropriate, I use them as a starting point to create my own. To be fair, I have been in this industry for 7 years, and I have never created my own hashing algorithm since college. But if I did create my own algorithm, minimizing collisions should be the biggest thing to think about. What are your likely values? Does this function accurately disperse those values so that hopefully there is a one to one relationship between the resulting value and the original value. Do the resulting values really disperse well. Meaning that say, they don’t all have common factors? This could cause collisions when you do modulus operations to make the value smaller and fit into your indexed collection.

Charles Graham 2008-11-11 04:52:15

Answer 5

+1 A:

Inventing a hashing algorithm is easy. Inventing a working, efficient, and effective hashing algorithm is not.

You should ask yourself:

Do I need to have a hash at all?
Assuming I do then implement a standard cookbook recipe (e.g. Effective Java) including all related requirements (e.g. if a.equals(b) then a.hashCode() == b.hashCode())

If you have two instances of an object that need to be compared for equality, then you probably need to provide an implementation for equals().

If you have multiple instances of an object that need to be sorted, then you probably need to provide an implementation for compareTo().

If you have a custom object being used as the key of a Map, then you probably need to provide an implementation of hashCode().

Michael Rutherfurd 2008-11-11 05:18:30

ansaurus

tags:

views:

answers:

What issues do you consider when designing a hash function?

related questions