Doing data warehouse development. We had a dimension with about 9,000 rows. The queries that were being developed included some really ugly queries.
So, I started analyzing the dimension rows. Dimension rows were clustered based on various combinations of columns. The clustering was a map from some key to a list of dimension rows that shared that key. The hash key, then, was a tuple of column values.
The intermediate result, in Python, looked like this
{
( (col1, col2), (col3, col4) ) : [ aRow, anotherRow, row3, ... ],
( (col1, col2), (col3, col4) ) : [ row1, row2, row3. row4, ... ],
}
Technically, this is an inverted index.
The hash key required some care in building a tuple of column values, partly because Python won't use mutable collections (i.e., lists). More important, the tuples weren't simple flat lists of column values. They were usually two-tuples that attempted to partition the dimension rows into disjoint subsets based on key combinations
The hash algorithm, down deep, is the built-in Python hash. Choosing the keys, however, wasn't obvious or easy.