tags:

views:

68

answers:

3

Hi,

Currently I working on a project where some information have to be hashed. As the dataset is huge (millions of records created every day) the algorithm for data transformation has to be fast.

The pieces of data that have to be hashed are fixed length (11 decimal numbers - example: 05018144298). So what I would like to know is whether it is worth to create own hash function instead of using some of the existing (for example MD5) in order to significantly decrease the processing time and if so then what would be the best way to do it. Is it a good way to modify some of the existing algorithm (for example MD5 but break the input into chunks of smaller size and modify other parameters for fixed input of 11 decimal numbers) or is it better to design a hash function from scratch?

Thank you!

A: 

If your goal is to conceal personally identifiable information (such as phone numbers, social security numbers, etc.) then a hash is not a great solution. It will always be susceptible to attacks along the lines of a rainbow table, and (as others have pointed out quite clearly) you will get no security out of depending on some private hash function that you develop yourself.

Make a one-time pad (OTP). This is just a table keyed on the personally identifiable number, with a second column containing a random number in the same format. That second column is randomly generated (using the cryptographically secure RNG in Windows CSP or something similar) and guaranteed to be unique due to a unique index defined on it.

Use the OTP to replace all instances of the personally identifiable number with the corresponding random equivalent. Once you're done, throw the OTP away.

At this point, there are no stored secrets that could compromise the privacy of the data. In fact, the only way you could figure out what the random numbers correspond to is if you have all of the original data, and even that would be less than trivial.

Steven Sudit
+2  A: 

Don't try to create your own hash or encryption algorithm. If you aren't an expert in this field you will likely mess it up. Use an existing algorithm, developed by people that really knew what they were doing, implemented by people who knew what they're doing and that has been tried and tested.

That being said, it's not clear to me what you want to hash:

If it is a single number with 11 digits you can store the number in a 64bit integer (long long int in C). Would it be an option to just treat the number as already being a hash ?

If it's a 11-tupel, that is for example 11 32bit-numbers, then go with an existing algorithm like MD5, SHA-1 or whichever algorithm you like that is supported by your system, e.g. by OpenSSL. OpenSSL also has support for taking advantage of dedicated crypto-chips and extensions of your CPU (like all the MMX variants but even dedicated extensions for speeding up algorithms like AES that a few processors provide), so speed shouldn't be a problem.

DarkDust
+2  A: 
  1. It is not worth doing anything, performance-wise, until you have actually measured that using an existing hash function really has some non-negligible impact. A typical MD5 implementation, on a typical PC, will be able to process a few millions small messages per second, using a single core on the main CPU. Chances are that your "millions per day" are a piece of cake.

  2. Designing your own hash function, while keeping the security features of a hash function, is a very bad idea. Right now, the top cryptographers in the world are involved in the design of a new standard hash function, in an open competition organized by NIST. Dozens of very specialized researchers have worked on those for several years, and will keep on doing so for about two more years. A lone programmer, not very specialized in the subject, trying to do better within a few days or weeks, verges on the preposterous. Designing a secure hash function is hard.

The right thing to do, for you, is to use an existing, standard cryptographic hash function. That's not MD5, by the way; serious weaknesses have been uncovered in that function (actually, serious weaknesses have been uncovered around 1996, and MD5 has been unrecommended for the last 15 years). You'd better use SHA-256.

If you do not need the cryptographic properties of a hash function but just a kind of randomizing function for hashtable-like indexing, then any hash function will be good enough. Just profile it, notice that there is no performance issue whatsoever, and be happy.

Thomas Pornin
The advice about performance and not reinventing the wheel is spot on. As for a hash used just for good distribution, it turns out that any crypto hash will have excellent distribution, but many non-crypto hashes are iffy.
Steven Sudit
the known weaknesses for MD5 are collision attacks, not preimage attacks, so don't dismiss MD5 out of hand.
Jason S