views:

1239

answers:

7

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.
A: 

DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.

I'm not aware of a function but here's an idea that might help:

  • Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit [1], for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".

You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.

Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

Adamski
-1 The longer the strings get, the more collisions you'll get. Additionally the hash is weak as (assuming natural language) most strings will contain vowels and frequent consonants (it's even possible to quite reliably guess the language of a string by frequent consonants and vowels by the way).
sfussenegger
@sfussenegger: The OP mentions the need to has textual strings well, which implies some upper limit on string length (e.g. longest word in the English language is only 45 characters). Additionally, the OP makes no mention that this needs to be a secure hash.
Adamski
Why does this requirement imply an upper limit on string length? This hash function is extremely weak - regardless of being secure or not.
sfussenegger
I read "textual strings" to imply natural language. If this isn't the case then I agree my approach is invalid. Why do you think it's weak?
Adamski
I suspect that this algo gives *lots of clashes* with short words: `rate == tear`, `rose == sore` etc etc.
oxbow_lakes
@oxbow_lakes: You're correct. In this case perhaps the remaining 12 bits should be used to store the result of String.hashCode() (modulo 4096).
Adamski
a) it doesn't work for special characters, numbers and the like b) see @oxbow_lakes' comment plus cases with duplicated characters (lose == loose) c) it won't work for long texts as they will have close to all character flags set ... I might even find more reasons, but that should enough for now
sfussenegger
@Adamski: if String.hashCode() is the strongest part of your hash function, why don't replace the whole function with String.hashCode()? If you change it from 32 to 64 bit - voila - you get to the exact approach I've suggested :) ... and I can't believe somebody voted for this suggestion at all.
sfussenegger
@sfussenegger: I agree with *all* your points. My solution assumes that the OP wants to hash individual natural language words.
Adamski
@sfussenegger: The approach you suggested is entirely valid. However, I was merely trying to suggest something that could help the OP if their *problem domain was sufficiently narrow*. If not then sure, they can call String.hashCode(), but given ripper's rep I imagine he already knows how to do this.
Adamski
+3  A: 

Create an SHA-1 hash and then mask out the lowest 64bits.

Aaron Digulla
You could even do an XOR of the first and last 64 bits. But isn't a SHA-1 hash a little over the top? If a cryptographically secure hash isn't necessary, you definitely lost some points on requirement 5 ;)
sfussenegger
@sfussenegger: Don't try to add random randomness. XOR doesn't always help. Even clipping the hash can have unpredictable results. Give it a try with a few million test cases or understand the mathematics behind it. Otherwise, you just make things worse with such a "blind improvement".
Aaron Digulla
It's not about adding random randomness. The idea was simply to keep all bits of the SHA-1 hash which is **designed to be uniformly distributed**. Therefore, there shouldn't be any unexpected side effect - but at second though it's useless overhead in the end. Clipping doesn't have unpredictable results as well - because that's exactly what e.g. HashMap.indexFor(int,int) does to map a hash to an index of the hashed table. So it really doesn't matter if you clip a 128 bit hash to 64 bit, as it will be further clipped to fit the hashed table anyway.
sfussenegger
It really depends on the properties of the bits. Folding the hash this way can create unexpected clumps which *shouldn't* happen with clipping. But clipping can also fail if the various strings produce with very similar lower 32bits. That's why it's usually better not to try to "improve" an existing algorithm without know its exact properties. I'm talking here because I once created a hash which didn't hash very well :)
Aaron Digulla
I think we all did it at least once ;) I'd be very interested to see if either approach causes problems with SHA-1. Maybe somebody finds some time to run a simple test?
sfussenegger
+9  A: 

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

sfussenegger
Note that for Strings <= 5 chars, the upper 32bits will be 0. So this will only make a difference for longer strings.
Aaron Digulla
Good catch. A higher prime and non-null start value should help.
sfussenegger
I've chosen a quite high 64bit prime as a start value. As a result, hashes for Strings <= 5 chars shouldn't be 0 in the first 32bits. However, on second thought I doubt that having 32 0s at the beginning hurt the properties of the hash function. I kept 31 as a second prime as this is what's used in String.hashCode() too (Which still is very close to what I suggest here)
sfussenegger
A: 

Do you look at Apache commons lang?

But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long isntead of int). For 128 bit you need addionational hocks...

St.Shadow
Commons-lang won't help at all for hashes larger than the standard 32-bit ones. It does those very well, but beyond that not so much.
jasonmp85
+1  A: 
long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
brianegge
-1 Hardware resources is not the issue here - I didn't specify how the hash value to be used, but I promise you it's not "storing n values in a hashmap". I will get collisions if I process enough items before I run into hardware problems.
ripper234
I could imagine a *huge* hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your almost certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion) is quite impossible though.
sfussenegger
The main point here: A couple of people at Sun and all over the world have improved this algorithm over the past ten years. It's unlikely that you can come up with something batter without investing at least a week or so doing an extensive research on the distribution properties of your strings.
Aaron Digulla
A hash table with a 32-bit hash will start running into problems after about 65,536 entries. A 32-bit hash won't suffice for a hash table with millions of entries. Even concatenated 32-bit hashes don't work; tried it once and experienced a massive slowdown
Seun Osewa
A: 

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"

Peter Tillemans
A: 

Do something like this:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.

Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.

SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.

I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.

jasonmp85