views:

345

answers:

4

Hi folks, I'm looking for a simple hash algroithm that will give me one byte of output for a string input (the inputs will be RFC822 email addresses, if that helps).

I'd like it to be simple, fast, and to magnify input differences (so two similar addresses have differnt outputs). (Yes, I am asking for a lot in one byte of output.)

Idealy, I'd like an XSL answer, but I can take it in either Java or Javascript (and then pass the hash as an argument to the XSL processor).

Thanks.

+1  A: 

Use a CRC-8, which has 9 bits of information, then drop a bit off either end and call it a day. Otherwise use any of the other common CRC algorithms.

whatsisname
+1  A: 

Why not take the most/least significant byte of the standard String hashCode() function ?

Brian Agnew
That should work as a java solution, but I'm holding out for an XSLT one.
Moose Morals
+1  A: 

Every hash function has its strengths and weaknesses, and fast and easy to compute ones tend to behave badly for certain classes of data. Trial and error needs to be a part of any solution. In addition to the other suggestions, you might try using integer multiplication as part of the hash function, for example

hash = 0
for (int i=0; i<data.length; i++)
    hash = ((37 * hash) + data[i]) & 0xff;
GregS
A: 

My suggestion would be to simply XOR all the bytes in the string. Every bit of every byte will influence the end result, and any single-bit error will definitely cause the hash to differ.

Very simple, very fast. And probably nearly as good as any other solution, given the small number of result bits.

Carl Smotricz
You'd probably want a bit more than that, as most email addresses are predominantly lowercase ascii with one '@' and one '.'; so you get only about 5 bits of variation rather than 8.
Pete Kirkham
I don't believe the wildly simplistic premise of the question justifies any more effort than this. Does it really matter if you detect different addresses in 31/32 cases (=96.9%) or 255/256 (=99.6)?
Carl Smotricz
A possible problem with a simple XOR is that it is unaffected by order, so that "abc" has the same hash as "cba". But the XOR could ultimately work as well or better as something more painful, because as you point it is basically impossible to do very well with an 8 bit output. The XOR method will detect an odd number of bit errors in *any* of the bit positions, a definite plus.
GregS
At the risk of overengineering this, I would be tempted to compute the lower order 5 bits of the hash as you suggested, and the high-order 3 bits using something like my earlier method but with hash3 = ((5 * hash3) + data[i]) mod 7 instead. This should also detect about 6/7 of the misordered addresses as well.
GregS