views:

195

answers:

5

I am not looking for links to information on hashing.

I am not looking for the worlds greatest hash function.

I am interested in mini-stories describing

  • The problem domain you were working in
  • The nature of the data you were working with
  • What your thought process was in designing a hash function for your data.
  • How happy were you with your result.
  • What you learned from the experience that might be of value to others.
+9  A: 

The first issue I consider is whether an established algorithm will suit my requirements.

Adam Liss
Some questions that may help determine suitability:Are collisions acceptable?Should the hash be reversible?Does the hash need to be continuous?
Jason Hernandez
+1  A: 

i'll second what Adam said: don't reinvent the hashing wheel

the only time i have ever needed a custom hashing function was to compare digraphs for equality; the hashing function let me tell very efficiently when two graphs were not equal (when the hash values matched i still had to do a node-by-node comparison to be certain)

Steven A. Lowe
+1  A: 

Doing data warehouse development. We had a dimension with about 9,000 rows. The queries that were being developed included some really ugly queries.

So, I started analyzing the dimension rows. Dimension rows were clustered based on various combinations of columns. The clustering was a map from some key to a list of dimension rows that shared that key. The hash key, then, was a tuple of column values.

The intermediate result, in Python, looked like this

{ 
    ( (col1, col2), (col3, col4) ) : [ aRow, anotherRow, row3, ... ],
    ( (col1, col2), (col3, col4) ) : [ row1, row2, row3. row4, ... ],
}

Technically, this is an inverted index.

The hash key required some care in building a tuple of column values, partly because Python won't use mutable collections (i.e., lists). More important, the tuples weren't simple flat lists of column values. They were usually two-tuples that attempted to partition the dimension rows into disjoint subsets based on key combinations

The hash algorithm, down deep, is the built-in Python hash. Choosing the keys, however, wasn't obvious or easy.

S.Lott
+1  A: 

The first thing I think of is the best place to steal a hashing algorithm and its code. If and only if I find no algorithm appropriate, I use them as a starting point to create my own. To be fair, I have been in this industry for 7 years, and I have never created my own hashing algorithm since college. But if I did create my own algorithm, minimizing collisions should be the biggest thing to think about. What are your likely values? Does this function accurately disperse those values so that hopefully there is a one to one relationship between the resulting value and the original value. Do the resulting values really disperse well. Meaning that say, they don’t all have common factors? This could cause collisions when you do modulus operations to make the value smaller and fit into your indexed collection.

Charles Graham
+1  A: 

Inventing a hashing algorithm is easy. Inventing a working, efficient, and effective hashing algorithm is not.

You should ask yourself:

  1. Do I need to have a hash at all?
  2. Assuming I do then implement a standard cookbook recipe (e.g. Effective Java) including all related requirements (e.g. if a.equals(b) then a.hashCode() == b.hashCode())

If you have two instances of an object that need to be compared for equality, then you probably need to provide an implementation for equals().

If you have multiple instances of an object that need to be sorted, then you probably need to provide an implementation for compareTo().

If you have a custom object being used as the key of a Map, then you probably need to provide an implementation of hashCode().

Michael Rutherfurd