views:

152

answers:

5

What are some simple ways to hash a 32-bit integer (e.g. IP address, e.g. Unix time_t, etc.) down to a 16-bit integer?

E.g. hash_32b_to_16b(0x12345678) might return 0xABCD.

Let's start with this as a horrible but functional example solution:

function hash_32b_to_16b(val32b) {
    return val32b % 0xffff;
}

Question is specifically about JavaScript, but feel free to add any language-neutral solutions, preferably without using library functions.

The context for this question is generating unique IDs (e.g. a 64-bit ID might be composed of several 16-bit hashes of various 32-bit values). Avoiding collisions is important.

Simple = good. Wacky+obfuscated = amusing.

+3  A: 

I think this is the best you're going to get. You could compress the code to a single line but the var's are there for now as documentation:

function hash_32b_to_16b(val32b) {
    var rightBits = val32b & 0xffff; // Left-most 16 bits
    var leftBits = val32b & 0xffff0000; // Right-most 16 bits

    leftBits = leftBits >>> 16; // Shift the left-most 16 bits to a 16-bit value

    return rightBits ^ leftBits; // XOR the left-most and right-most bits
}

Given the parameters of the problem, the best solution would have each 16-bit hash correspond to exactly 2^16 32-bit numbers. It would also IMO hash sequential 32-bit numbers differently. Unless I'm missing something, I believe this solution does those two things.

I would argue that security cannot be a consideration in this problem, as the hashed value is just too few bits. I believe that the solution I gave provides even distribution of 32-bit numbers to 16-bit hashes

John Bledsoe
Why do you think this is the best? I think it can get an awful lot of collisions for useful and frequent numbers.
Rotsor
This isn't the best idea. The reason is that IP addresses are often assigned as contiguous subnets. This means that if the IP address A.B.C.D exists on a network then A.(B^1).C.D and A.B.C.(D^1) are slightly more likely to exist too and will get the same hash.Obviously any hash will have lots of collisions. But your scheme will have more collisions than you'd expect from hashing 32-bit integers picked uniformly. You'll get better results by churning up the bits a little more.
Rotsor
@Rostor Ha, you are correct sir. The million-dollar question in all of this is the distribution of data that is in view.
John Bledsoe
+1  A: 

Something simple like this....

function hash_32b_to_16b(val32b) {    
    var h = hmac(secretKey, sha512);
    var v = val32b;
    for(var i = 0; i < 4096; ++i)
        v = h(v);
    return v % 0xffff;
}
Justice
Why 4096 times?
dkamins
To slow it down. This is a common technique for hashing passwords, to make it orders of magnitude more difficult to create a rainbow table or brute force passwords.
Justice
+2  A: 

I would say just apply a standard hash like sha1 or md5 and then grab the last 16 bits of that.

dreeves
Might there be issues with short input streams (like 4 bytes) for sha1 or md5?
dkamins
sh1 and md5 are typically not available in JavaScript environments. Are there slightly less secure but greatly simplified versions expressible in a few lines of JS?
dkamins
+2  A: 

Assuming that you expect the least significant bits to 'vary' the most, I think you're probably going to get a good enough distribution by just using the lower 16-bits of the value as a hash.

If the numbers you're going to hash won't have that kind of distribution, then the additional step of xor-ing in the upper 16 bits might be helpful.

Of course this suggestion is if you're intending to use the hash merely for some sort of lookup/storage scheme and aren't looking for the crypto-related properties of non-guessability and non-reversability (which the xor-ing suggestions don't really buy you either).

Michael Burr
+1  A: 

This depends on the nature of the integers. If they can contain some bit-masks, or can differ by powers of two, then simple XORs will have high probability of collisions. You can try something like (i>>16) ^ ((i&0xffff) * p) with p being a prime number.

Security-hashes like MD5 are all good, but they are obviously an overkill here. Anything more complex than CRC16 is overkill.

Rotsor
This is an interesting point and apparently relevant to hashing IP addresses, yes?
dkamins
Yes. For time values i anywhere :))
Rotsor
@Rotsor: Will any fixed prime number suffice? Why does this work?
dkamins
There is no way to tell what will "suffice" unless you know exactly what input data you will have. The worst-case number of collisions will still be the same.Multiplication by a prime number just makes it harder to find a real-life situation which will produce collisions systematically. (how often is your delta-time a multiple of 1009?)Why primes are better at this [is a long discussion](http://stackoverflow.com/questions/1488977/why-multiply-by-a-prime-before-xoring-in-many-gethashcode-implementations)
Rotsor