views:

428

answers:

7

I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?

I am doing this in Java, but I wouldn't imagine that would make much of a difference.

+5  A: 

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash=7;
for (int i=0; i < strlen; i++) {
    hash = hash*31+charAt(i);
}
jonathanasdf
+5  A: 

If you are doing this in Java then why are you doing it? Just call .hashCode() on the string

Pyrolistical
I'm doing it as part of the class, and part of the assignment is to write up several different hash functions. The professor told us to get outside help for the 'better' ones.
Leif Andersen
A: 
// djb2 hash function
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

source Logic behind djb2 hash function - SO

TheMachineCharmer
+1  A: 

You should probably use String.hashCode().

If you really want to implement hashCode yourself:

Do not be tempted to exclude significant parts of an object from the hash code computation to improve performance -- Joshua Bloch, Effective Java

Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.

Here's a war story paraphrased on the String hashCode from "Effective Java":

The String hash function implemented in all releases prior to 1.2 examined at most sixteen characters, evenly spaced throughout the string, starting with the first character. For large collections of hierarchical names, such as URLs, this hash function displayed terrible behavior.

Frederik
A: 

If it's a security thing, you could use Java crypto:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToEncrypt.getBytes());
String encryptedString = new String(messageDigest.digest());
Nick
A: 

If you want to see the industry standard implementations, I'd look at java.security.MessageDigest.

"Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value."

Dean J
A: 

FNV-1 is rumoured to be a good hash function for strings.

For long strings (longer than, say, about 200 characters), you can get good performance out of the MD4 hash function. As a cryptographic function, it was broken about 15 years ago, but for non cryptographic purposes, it is still very good, and surprisingly fast. In the context of Java, you would have to convert the 16-bit char values into 32-bit words, e.g. by grouping such values into pairs. A fast implementation of MD4 in Java can be found in sphlib. Probably overkill in the context of a classroom assignment, but otherwise worth a try.

Thomas Pornin