views:

35

answers:

1

What is a general collision-free Java best practice to generate hash codes for any-type (atomic types) multi-column primary keys?

I thought about it for a few hours and came to the conclusion, that a string concatenated by all primary key columns would be the only reliable way to do so. Then calling Java's hashCode method on that concatenated string should yield a unique integer. (it would in fact somehow mimic what a database index does, not sure here though)

For a multi-column primary key of the form

CREATE TABLE PlayerStats
(
    game_id INTEGER,
    is_home BOOLEAN,
    player_id SMALLINT,
    roster_id SMALLINT,
    ... -- (game_id, is_home) FK to score, (player_id, roster_id) FK to team member
    PRIMARY KEY (game_id, is_home, player_id, roster_id)
)

a hash code could be calculated like:

@Override
public int hashCode()
{
    //                                                                 maxchars:
    String surrogate =   String.format("%011d", this.gameId)         //11
                       + String.format("%01d" , this.isHome ? 1 : 0) //1
                       + String.format("%011d", this.playerId)       //6
                       + String.format("%011d", this.rosterId)       //6

    System.out.println("surrogate = '" + surrogate + "'");

    return surrogate.hashCode();
}

Of course, this only works with HashSets and Hashtable when equals is also based on this.

My question: is this a good general strategy?

I can see on-the-fly calculation might not be the fastest. You might want to recalculate the hash code whenever a composite key value was changed (e.g. call a rehash() method from within every setter operating on a key property.

Suggestions and improvements welcome. Aren't there any generally known strategies for this? A pattern?

A: 

The hash code is used as an index to look up elements in the data set that have the same code. The equals method is then used to find matches within the set of elements that have the same hash code. As such, the generated hash code doesn't have to be 100% unique. It just needs to be "unique enough" to create a decent distribution among the data elements so that there isn't a need to invoke the equals method on a large number of items with the same hashCode value.

From that perspective, generating lots and lots of strings and computing hash codes on those strings seems like an expensive way to avoid an equals operation that consists of 3 integer and 1 boolean comparison. It also doesn't necessarily guarantee uniqueness in the hash code value.

My recommendation would be to start with a simple approach of having the hash code of the key being the sum of the hash codes of its constituents. If that doesn't provide a good distribution because all of the ids are in a similar range, you could try multiplying the ids by some different factors before summing.

btreat