views:

1151

answers:

8

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.

My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.

The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.

The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.

+2  A: 

If you need security, use a cryptographically secure hash, such as SHA-256.

Ants Aasma
Ideal would be a result as an integer. SHA-256 is a bit overkill for my use.
skamradt
For short input simple cryptographic hash functions are to simple because they can be easily attacked with brute force.
Daniel Brückner
A derivation function such as scrypt to make the computation harder would be adivsable. An integer sized output will result in collisions at about 65k entries. And that is without accounting for possible cryptanalysis attacks resulting from discarding parts of the output.
Ants Aasma
Credit card numbers are 17+ digits (The dates have limits, I wouldn't even give them two digits worth of data.) That's pretty hard to brute force.As for SHA-256 being overkill--use it and then XOR chunks together down to whatever size you decide to go with.
Loren Pechtel
@Daniel: The issue isn't the function, but rather the input. No algorithm is going to magically turn an enumerable range into one that isn't. All you can do is use the best primitive you have (secure hashing, in this case), and iterate it for extra security.
Nick Johnson
+1  A: 

By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.

So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.

tonfa
For short inputs simple cryptographic hash functions are to simple because they can be easily attacked with brute force.
Daniel Brückner
+5  A: 

This seems to be a case for key derivation functions. Have a look at PBKDF2.

Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.

UPDATE

Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.

Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.

While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.

Daniel Brückner
Interesting, but looping through the routine a minimum of 1000 times (for PBKDF2) doesn't actually sound like optimal to me.
skamradt
If you need the hash for security critical operations, don't worry about executions times. Time is the enemy of the attacker. If you didn't mean security critical operations, PBKDF2 is probably really not the best choice.
Daniel Brückner
+1  A: 

For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.

As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is

=Edit - My code sample was incorrect - now fixed =

In c/c++

unsigned int Hash(const char *s)
{
    int hash = 0;

    while (*s != 0)
    {
     hash *= 37;
            hash += *s;
     s++;
    }

    return hash;
}

Note that '37' is a magic number, so chosen because it's prime

zebrabox
This one looks promising without any analysis. You might want to point out that 37 is an optional parameter (which must be prime).
Will Bickford
@Will Bickford. Thanks, I'll update my post
zebrabox
The code you posted does not match the code at the link you provided. The code at the link uses xor, not addition, and it multiplies the hash by the prime value, not just each of the bytes being hashed.
Rob Kennedy
@Rob Kennedy. The code sample is not related to the link - they are completely different. I offered the code sample as fast alternative
zebrabox
+6  A: 

With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.

You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.

The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.

The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.

Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.

If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.

Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?

Jim McKeeth
Good advice, except that CRC32 is not a good hash function for this use case at all. It's intended to detect transmission errors, and is good at that, but makes no promises about collision rate. A non-cryptographic hash like FNV32 would be a better choice.
Nick Johnson
@Nick: Good advice. I've updated the answer.
Jim McKeeth
I ended up using a SHA implementation which appears to be collision proof.
skamradt
In theory SHA is not collision proof. In practice it is though. We can just say the odds are way in your favor.
Jim McKeeth
Doesn't 32bits give you ~4 billion numbers, a wee bit larger than 65536? Or am I missing something obvious? Nice answer by the way.
Alister
@Alister, you are correct. I will update my answer with that.
Jim McKeeth
+4  A: 

If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.

By default it uses P.J. Weinberger ELF hashing fuction. But others are also provided.

Erwin
A: 

Best hash function for the natural numbers let

 f(n)=n

No conflicts ;)

ldog
(-1) because this answer contains only humor. Don't get me wrong, I'm a regular Slashdot reader -- and love the humorous comments over there! -- I just don't think that this kind of stuff belongs here. See [this discussion][1].[1]: http://meta.stackoverflow.com/questions/17782/why-do-stackers-consistently-vote-down-humorous-responses
paprika
+2  A: 

I needed to look deeply into hash functions a few months ago. Here are some things I found.

You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.

ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.

But there are some really good algorithms available in Delphi and asm. Here are some references:

See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c

SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html

You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his hashtable. I suggested the old but nicely performing elf-hash, but also noted a much better hash function I recently found. It was called SuperFastHash (SFH) and was created by Paul Hsieh to overcome his 'problems' with the hash functions from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm. A few people worked on a basm implementation and posted it."

The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196

Sorry if the above turns out to look like a mess. I had to just cut&paste it.

At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.

You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.

Guy Gordon