views:

524

answers:

10

I run a website where we mark certain accounts as scammers, and "flag" their account and all credit cards used as being bad. We don't store actual credit card values, but are storing a checksum/MD5 algorithm of it instead.

We are hitting collisions all the time now. What is the best way to store these values - non reversible, but able to do comparisons on future values.

I thought MD5 would be the best, but we've got a debate going on here...

+3  A: 

Use SHA1, hash collisions are yet to be found.

Niteriter
"Hash collisions are yet to be found"? That's very unlikely. What you meant to say is probably "no collision attack" is yet known. Collisions can always occur when you reduce arbitrary large amounts of data to a fixed-size value.
Joachim Sauer
You're right, this is what I meant. Sorry, I'm not native speaker and sometimes find it difficult to express myself.
Niteriter
SHA-1 is broken: http://www.schneier.com/blog/archives/2005/02/sha1_broken.html, so use SHA-256
Jacco
+2  A: 

If you are finding collisions with MD5, why not use a better algorithm such as SHA1 or SHA256?

matt b
+12  A: 

A cryptographically secure hash would work. (SHA512 or SHA256 would be OK)

However, I would use a fairly secret salt that is not stored along with the cards (to prevent any sort of rainbow table attack).

PS:
Rainbow table attacks against credit cards could be particularlly effective, since the total size of the plain-text-space is quite small due to the limited character set, the fixed size, and the check digits.

PPS:
You can't use a random salt for each entry, because you would never be able to feasibly check duplicates. Salts are used to prevent collisions, whereas we are specifically looking for a collision in this instance.

John Gietzen
There are heated discussions about that on SO, but apparently you can safely store the salt along with the salted information, as long as salt is varied (would need a rainbow table per salt).
PhiLho
Excuse my ignorance, but why not just use a random salt, as is done here if "null" is passed? http://www.obviex.com/samples/hash.aspxI'm missing the point of the salt, if it's not needed to decrypt (such as in the link posted). Is this not a good example class to use?
TheSoftwareJedi
We aren't trying to store and recheck, we are trying to do duplicate matching. At that point, in order to check a duplicate, you would have to hash the CCard Number with EVERY salt in the database.
John Gietzen
+1  A: 

MD5 is NOT the way to go since it's broken. Quote Bruce Schneier: "[w]e already knew that MD5 is a broken hash function" and that "no one should be using MD5 anymore."

I.e. use SHA512 or SHA256 as someone already proposed.

Björn
SHA1 is also broken: http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
Jacco
+1  A: 

Using the strongest hash possible is usually good. Speed is not of the essence and slowness actually works against anyone trying a brute force reversal of your hashed values.

I like whirlpool, personally - if you're using PHP check out the supported algorithms at the hash function docs

Whirlpool returns a string 128 characters long, but you don't have to store all of it necessarily. The first 32 or 64 chars would suffice. You could also consider sha512 or sha284.

Mark Snidovich
+3  A: 

It isn't sufficiently safe to just use a good Hash algorithm. If your list is stolen, your stored hashes can be used to retrieve working card information. The actual schema-space for credit card numbers is small enough that a determined attacker can pre-calculate many of the possible hashes ahead of time as well, and this may have other implications for your system if there is an intrusion or an inside-job.

I recommend you use a salt and also calculate a 2nd value to be added to the salt based on a formula involving each digit of the card number and the first salt value. This assures that if you lose control of either part, you still have reasonable uniqueness that renders ownership of the list useless. The formula should not be heavily weighted toward the first 6 digits of the card (BIN number), though, and no trace of the formula should be stored in the same location as either the salt or the final hash.

Consider the anatomy of a 16-digit credit card number:

6 digit BIN (Bank Identification Number)
9 digit Account Number
1 digit Luhn Checksum

BIN lists are well known within the processing industry and are not too difficult to assemble for those with access to an illicit list of card numbers. The number of valid BINs is further diminished by the assigned space for each issuer.

Visa - Starts with 4
American Express - Starts with 34 / 37
MasterCard - Starts with 5
Discover/CUP - Starts with 6
Diner's Club - Starts with 35
etc.

Note that some of the assigned BIN information within each issuer category is also sparse. If an attacker is aware of where most of your customers are located, then that will cut down the uniqueness considerably, as BIN information is assigned on a per-bank basis. An attacker that already has an account issued by a small bank in a wealthy neighborhood could just get an account and use the BIN as a starting point on his own card.

The checksum digit is calculated with a well-known formula, so that is immediately discardable as a source of unique data.

Armed with a handful of BINs worth targeting, an attacker has to check 9 digits at a time for each BIN set. This is 1 Billion Checksums and Hash Operations per set. I don't have any benchmarks handy, but I'm pretty sure 1 Million Hash operations per minute is not unreasonable for MD5 or any flavor of SHA on a suitably powerful machine. This amounts to less than a day to crack all matches under a given BIN.

Finally, you might consider storing a timestamp or visitor token (IP/subnet) with your hashes as well. It is nice to catch duplicate card numbers, but also consider the ramifications of someone stuffing your system with bogus card numbers. At some point you need to decide on a trade-off between blocking card numbers that you know are invalid, and also give yourself a mechanism to identify and repair misuse.

For example, a disgruntled employee could be stealing card information on his own and then use your hash mechanism against you by inserting valid hashes into your card number blacklist to block repeat business. It is quite expensive to undo this if you are just storing a hash- everything is opaque once it has been converted to a hash. With this in mind, give yourself a method to identify the source of the hash as well.

meklarian
Perhaps I'm naive but without the information of the name on the card or the expiration date, what can someone do with just the card number?
dlamblin
the cardholder name isn't used at the time of sale by processors to decide whether to reject or approve a transaction. it may be used afterward by any number of parties in the processing chain or by the software that relayed it. hence it is not usable as a barrier or as protection at the time of an attempted transaction. the expiration date does matter, but some ISOs/processors provide live lookups for test transactions. this makes it possible to brute force the ~120 forward months for a given card's expiration date if you have access to such a test account.
meklarian
+2  A: 

Perhaps you can store two different hashes of the card number. The chances that both hashes will result in collisions is practically zero.

Loadmaster
+1  A: 

Dont bother doing salts, just use HMACs. I know it's kind of an abuse, but then you get a decent keyed hash, so you can prevent collisions and rainbow table attacks.

The nice thing here is that even if the key leaks, nobody can decrypt it. The best thing that works for HMACs is brute force. Actually, the key here is a salt as mentioned earlier. The nice thing here is that the algorithm is a little better than the usual salting stuff done by most non-security programmers.

Henri
+1  A: 

As Henri already mentioned above (+1), the right solution is to use Message Authentication Code such as HMAC with a secret key. This is exactly the "secret salt" someone mentioned before. (BTW. Salts are always public).

Use standard construction such as HMAC-SHA-256 (RFC2104, FIPS-198a), keep the key secret and store the results (authentication tags) in a database.

The larger digest size (256 bits) of SHA-256 should prevent any collisions from happening, SHA-256 is a fairly good hash function and probability of random collisions is 2^-128, so if you ever encounter a collision in your system, please, let me know! :)

Krystian
A: 

People pointing out that a hash is "broken" are missing the point, perhaps regurgitating something they've heard without understanding what it means. When people talk about hashes being 'broken' they typically mean that it is possible to easily generate an alternate payload that has computes to the same hash.

This 'breaks' the hash but only for the specific purpose of using a hash to verify data is what it's supposed to be.

That isn't the important here, ie someone managing to create an alternate datastream that happens to hash down to the same value as one of the credit cards doesn't achieve anything meaningful or useful in terms of an attack vector.

The risk with hashes here is that the problem space for credit card numbers is pretty low and rainbow tables for them would be pretty cheap and easy to generate.

Adding a salt would add a bit of protection against already generated rainbow tables for pure card numbers but the extent to which it offers any real protection depends on how 'secret' the salt would remain in the case you are compromised. If the salt is exposed then new rainbow tables can then be cheaply generated and it's all over.

Given that the salt needs to be available to the application for it to perform checks against the blacklist there's a good chance someone compromising the blacklist data will also be able to get to the salt. If you have multiple servers you can mitigate that to some degree by ensuring both the salt and the data aren't in the same 'place' so an exposure of one server won't give someone all of the parts they need. (Similarly for backups don't store the data and the salt on the same media where someone can walk away with one tape and get everything). The salt only adds some protection while it is secret (in this type use).

If you have the resources to do it securely then I think that is the route to go. If you are getting a significant number of collisions on any reasonable hash function you must be doing a lot of volume. (In fact I'm highly surprised collisions would be a problem even then, any reasonable hash function should provide diverse results over a small problem space like this).

Paul