tags:

views:

55

answers:

3

I have a series of codes in the format:

AA12345A1

i.e.: [a-z]{2}[0-9]{5}[a-z][0-9]

and

AA12345A123

i.e.: [a-z]{2}[0-9]{5}[a-z][0-9]{3}

I need to create a new "code" of any format from either of the above to formats to obscure the difference between the ones ending in 1 number and the ones ending in 3 numbers (this reveals information to the user that I need to hide).

The constraints for the new code format are:

  • they need to be human usable (so using upper and lower case letters is a bad idea usability wise, the should also be as short as possible)
  • they must always be unique (no 9 or 11 char code should produce the same output)
  • it only needs to be a one way hash, I never need to get the original code back
  • the length of the original code (9 or 11 chars) must not be obvious - it doesnt need to be cryptographically strong, just opaque to the layman.

Are there any suitable hashing (or otherwise) algorithms to do this, or does anyone have any suggestion for a custom way of doing this?

Thanks

A: 

You have two tasks:

  • Create hash
  • Represent in a human usable form

So use e.g. SHA1 with the original string as input and get a binary result. => Hashing. From the binary hash get 9 (or 11) * 5 Bit and use the following table:

00000 -> "0"
00001 -> "1"
...
01001 -> "9"
01010 -> "A"
...
11111 -> "Z"

The table doesn't use some letters which could me mixed up (e.g. "L" == "l" could be accidently read as "1"; omit letters "Q" and "O" because you use digit "0"). You need 10 digits plus 22 characters.

If the user enters a code, replace all lower case letters with upper case ones and e.g. "l"/"L" with "1", because these must be typos. If possible, add another one or two characters as a checksum so that you can check for other typos (swapping to characters). So you can display an error message on the front end without doing any decoding / database lookup.

ur
But hashes are by their very nature not guaranteed to be unique. On the other hand, a SHA-1 is longer than 11 characters (even not considering the restricted values) so it may actually hash to unique values for all given inputs.
Konrad Rudolph
@Konrad: Yes, it's not guaranteed to be unique. But you can calculate the probability and if it's okay for the application it can be a solution which makes it hard to "guess" one code when knowing another.
ur
+3  A: 

Here's one possibility.

For nine-character codes, insert a random even letter (B,D,F,...) after the first digit and two random digits at the end.

For eleven-character codes, insert a random odd letter (A,C,E,...) after the first digit and leave the rest as is.

In both cases, you could also ROT-13 the non-noise alphas and ROT-5 the digits to further change the codes, though I'm not sure that's necessary for your purposes.

That way you end up with a twelve character code for both cases, which you can reverse if need be. It's human-readable and unique. It's not, as you say, NSA-level crypto but it should hold off the casual onlooker.


If you need a hash that generates a more deterministic result (i.e., no random numbers), you can make the added stuff dependent on the input data. Here's one way, there are probably hundreds more. Consider the two input types:

AB12345C6
AB12345C678

Still insert a character after the 1 in both cases but make it dependent on the input. Add up the digits at positions 1, 3 and 6 and take the modulo-10 of that to get 0 through 9.

Use that as a lookup into the string "ABXVRWECPU" for a nine-digit code or "OIYJTQLSDK" for an eleven-digit code to get the character. You can then use that character in the resulting code to decide whether it was a nine or eleven-character code initially (the truly paranoid would ensure those strings are not stored in plaintext in the code).

For the two digits to add to the first case, add up the ASCII codes for A, C and a function of B (for example, xor 'B' with 147), then add that to the numbers formed from 64, 51 and 23.

Take the modulo-87 of that then add 7 to get a value between 7 and 93.

paxdiablo
Great minds think alike... :(
Vinko Vrsalovic
I agree. Only the padding digits must not be random, but some hash (e.g. sum % 100). Furthermore, even and odd letters may be easily discovered.
Sjoerd
@Sjoerd: What's wrong with random numbers at the end?
Vinko Vrsalovic
A hash function always gives the same result if the input is the same, by definition. When you use random numbers, the hash may be different if you run it a second time.
Sjoerd
@Sjoerd, why not random? They're simply ignored anyway based on the inserted alpha, so it doesn't matter what they are. If you mean you want to ensure the same source gives the same result, I understand, but the letter couldn't be random then either. I'll add another more deterministic hash.
paxdiablo
@Sjoerd I agree about the even/odd. I dont see the problem with random numbers, you won't get a clash because of the added digit where you halve the alphabet for 9 vs 11 codes
Andrew Bullock
@sjoerd: From the question, it looks to me like true hashing is not really a hard requirement. @paxdiablo: While you're at it, fix the even/odd letters problem (a bit too telling.) Additionally, to have it fully deterministic you'd have to deduce the starting letter from the actual characters on the string (maybe using something like a CRC-128 would be enough)
Vinko Vrsalovic
+2  A: 

A very simple way to obfuscate would be to:

  • Secretly and randomly pick half the alphabet to mean 9 chars, and the other half to mean 11 chars.
  • Prepend one random letter from the proper half to the string (i.e., if it's a 9 chars string prepend a letter from the 9 chars half)
  • If the string is 9 chars, append 2 random digits

Then, on use, you know that if the first char is from the half meaning 9 chars, you can discard the final two digits.

You'd end up with 12 chars for every string though.

Vinko Vrsalovic