views:

362

answers:

10

I recently posted this question about codes for a gift-card-like voucher that users can redeem online. I wanted to find the best tradeoff between large keyspace, low guessability, and human readability. Now that I'm into implementation I realize I've got another problem altogether, more of an algorithmic challenge.

Let's assume I adopt some code format - say 10 characters from A to Z for simplicity, and I start generating vouchers. What is the correct algorithm to do this?!

My first approach is to number all possible codes from 0 to 308,915,776, then start generating random numbers in that range. This obviously has a big problem though - I have to check my random number against all previously generated voucher codes and if it collides with an existing one I'll have to discard the code and try another. As the system accumulates more data it will slow down. At the extreme when there is only one code left it will be nearly impossible for the system to guess it correctly.

I could pre-generate all codes and shuffle them, then consume them in order. But this means I have to store many codes, and in fact my keyspace is bigger than the one i described, so we're talking about a very large amount of data. So that's also not too desirable.

So this leaves me with using the codes sequentially. I do not want guessable voucher codes though. The user who buys voucher "AAAAAAAAAY" should not have a good chance of getting another valid code if they type in "AAAAAAAAAZ".

I can shuffle my alphabet and my positions so that instead of

'ABCDEFGHIJKLMNOPQRSTUVWXYZ' i use

'LYFZTGKBNDRAPWEOXQHVJSUMIC'

and so that instead of positions

9 8 7 6 5 4 3 2 1 0 the positions are

1 8 0 7 5 4 3 9 2 6

Using this logic, given the code

LNWHDTECMA

the next code would be

LNEHDTECMA

This is definitely way less guessable. But they're still only one character off from each other, and given just two of these vouchers you would know which position is incrementing, and you would have a 90% chance of getting the next code in 24 guesses or less.

My "escape hatch" is to ditch all this and go with GUIDs. They have more characters than I wanted my users to have to type in, and contain similar characters like I/1 and O/0, but they magically make all of the above headaches go away. Still, I'm having fun thinking about this, maybe you are too. I'd love to hear some alternate suggestions. What have you got?

Thanks!

+2  A: 

Hi again, I answered the other question too :P

The best way is to generate one alphanumeric character at a time, randomly, until you have 8 of them. This will then be your voucher.

Ideally the best way would be to choose a sequence long enough so that you can safely assume if there will be any duplicates. Do note that, perhaps counter-intuitively, this happens more often than you think because of the Birthday problem.

For example, with 8 characters you have 1785793904896 possible combinations, but if you generate only 1,573,415 vouchers you will have a 50% chance to have a duplicate.

So, it all depends on how many you want to generate, and the maximum length of the code you're comfortable with. If you are generating many and you want to keep it short, you should save the ones you previously generated, and check against the database for duplicates.

Andreas Bonini
+1 n^k is quite a lot (here: 26^10)
Thomas Jung
Hello again Andreas! I agree that for a sufficiently-sized keyspace the odds start to look pretty good. I guess I just have my computer scientist hat on right now and want a general purpose approach that is guaranteed to work - not just something that is likely to work given the current inputs.
Barry Fandango
That's not completely true about GUIDs at least for v1 algorithm, unless you generate more than (I think) 4 per ms and assuming the machine that's generating it has a device with a valid MAC address you're guaranteed uniqueness due to the timestamp component of the GUID as well as the MAC address component (and then a few random things thrown in). Other versions may not have a strict guarantee and may indeed be close enough to guaranteed by the sheer size of the GUID pool.
Davy8
If there are 1785793904896 possible combinations, then after generating about 1,573,415 at random, without testing for duplicates, the probability of collision reaches 0.5. See http://en.wikipedia.org/wiki/Birthday_problem .
Jason Orendorff
-1: The birthday paradox means that the probability of a collision increases dramatically with the number of used codes and starts approaching 50% somewhere around 1.3 million codes. GUIDs get around this by being just ridiculously long (too long to be practical for this purpose)
Michael Borgwardt
@Davy8: Well, still, take a look here: http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Random_UUID_probability_of_duplicates
Andreas Bonini
Jason/Micheal: you're right! I'll edit my question to reflect this
Andreas Bonini
-1 rescinded :)
Michael Borgwardt
+5  A: 

Some random number generators have an interesting property: Used right they do not generate duplicate numbers in a long time. They produce something called a full cycle. Use one of the algorithms described there, seed it, and you will have many unique numbers,

Add a smart way to map digits to characters and you got your codes.

ebo
Be aware that this - in some specific circumstances - can pose a serious security risk. If an attacker manages to get his hands on many vouchers that he knows have been generated subsequentially he can obtain the codes for all the vouchers by "reverse engineering" (probably not the technical term for this) the pseudo-RNG.
Andreas Bonini
@Andreas - And if you use a salt value: voucher(previousVoucher + salt)?
Thomas Jung
t. jung - could you salt the value without losing the "full cycle" property of the algorithm?
Barry Fandango
@Barry - It would not mess with the RNG if: voucher(previousVoucher, salt) = rng(previousVoucher) + salt. The question is if this makes the approach safer. If you use only the RNG and it and one voucher is known it is broken.
Thomas Jung
@Thomas is that any better than just adding some random junk to the end of the code and then stripping those digits off before checking it?
Martin Beckett
Martin: Yes. Checking the random junk is the important part because that's the part that an attacker, however clever, can't predict.
Jason Orendorff
@Martin - I would say yes. Adding and removing junk can be seen as part of the RNG algorithm. The approach has to work even when the RNG and vouchers are known. So you have to add a secret that is not known and will not be communicated. I would like to hear from someone if the salt approach is dumb idea.
Thomas Jung
Just use a cryptographically secure PRNG (http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator); don't listen to this salt nonsense. An example of a good generator would be SHA1-HMAC(LongSecretPassphrase, counter). Even our best super-computers have yet to find two SHA1 values that hash to the same thing, so you are unlikely to do it by accident generating a few hundred-thousand gift card codes.
BlueRaja - Danny Pflughoeft
See http://stackoverflow.com/questions/1650185/php-short-unique-id-generation-using-autoincrement/1650945#1650945 for in implementation of the RNG idea I did a while back.
Wim
@BlueRaja that avoids the collision by having a hash space much larger than the gift-card number space. Which is what we are trying to avoid.Yes it's difficult to calculate what to add to a known file to generate a SHA1 collision, but if you simply start hashing consecutive numbers - you are back to the birthday paradox
Martin Beckett
@Martin: As stated by Michael above, you **must** have a keyspace much larger than your gift-card space. As for the birthday paradox, that is a problem with or without a "salt": The birthday paradox states that you need to hash 2^80 SHA1 values before you have a 50% probability of having a collision. That's 1 hash every nanosecond for 33 million years. That's exactly the reason no one has even *found* two values that hash to the same thing.
BlueRaja - Danny Pflughoeft
But Sha1 means you need a 160bit result to code for a 3-4 digit gift card number - the OP wanted a way of having a series of much smaller (easier to type) codes without a collision. Thats why the RNG is a good solution.
Martin Beckett
+5  A: 

The likelihood of two randomly generated code colliding is basically the same as a user guessing a valid code - and you cannot prevent users from guessing. So you must have a key space so much larger than the number of actually used codes that random collisions are extremely unlikely as well (though, thanks to the birthday paradox, probably not unlikely enough to ignore them completely, at least if you want your codes to be reasonably short), and checking against existing codes and re-generating in case of a collision is a perfectly viable strategy.

Michael Borgwardt
+1 for storing and regenerating on collision. To get to a 50% chance of regenerating an individual key with just 10 uppercase letters, you'd have to have (26^10)/2 keys stored which is around 70.5 trillion, so the likelihood of having to regenerate so many times that it becomes infeasible is negligible. 90% of the time (made up number) you're not going to have to regenerate.
Davy8
If you had to regenerate even 10% of the time, it would mean that a random guess has a 10% chance of hitting a valid code - you want that likelihood to be much lower. I'd say that making the code space 1000 times larger than the number of codes you could theoretically expect to need should be OK.
Michael Borgwardt
That will be fun code to test. Once executed in a lifetime and then probably broken.
Thomas Jung
There definitely should be an automated test for that kind of thing...
Michael Borgwardt
+1  A: 

I think the best way to go is that suggested by Andreas. But my answer is about an interesting related discussion.

You want to generate a sequence of numbers that together form a permutation of S = {1, ..., MAX}. One way to do this is to take the elements of a cyclic group over S. For example, the numbers R = {x modulo p, x^2 modulo p, x^3 modulo p, ..., x^(p-1) modulo p} form a cyclic group over {1, ..., p-1}, provided p is a prime and x is coprime to p. So if you choose MAX as a prime number you do use this sequence.

You want a "tough-to-crack" sequence. A generator for the sufficiently-tough-to-crack sequence is called a pseudorandom generator (ofcourse you probably don't need that tough-to-crack). An example is the last digit of elements in R above, provided p is kept secret (am I correct?). But the answer by Andreas already uses a source of (pseudo-) random numbers, so cannot be called a pseudorandom generator.

If you are interested in pseudorandom generators, they are discussed in detail in volume 2 of Knuth's well-known book.

Amit Kumar
A: 

Here is a though:

  • ID = each voucher has an unique (auto-incremented?) ID
  • CHECKSUM = apply N iterations of the Verhoeff or Luhn algorithm on the ID
  • VOUCHER = base convert the generated CHECKSUM from base 10 to base 36

See also this related SO question: Ideas to create a small (<10 digits), not (very) secure “hash”.


One simple way to make this method more secure is to use a non auto-incremented ID value, one option might be to use ID as the last 6 or 7 digits of a UNIX timestamp and calculate the checksum.

Alix Axel
I'm not familiar with those algorithms but i'll go read about them, thanks.
Barry Fandango
So basically you depend on people trying to guess codes to not know N and which algorithm you're using? This is called "security through obscurity", and not good.
Michael Borgwardt
@Michael: Indeed, but he stated he wanted "low guessability" not "no guessability at all".
Alix Axel
@Michael: See my edit in a couple of seconds.
Alix Axel
If you can figure out how to generate valid codes at will, it's not even guessing anymore - the security is simply broken. And all it takes is one person to figure it out (or know it, e.g. a disgruntled employee), and a few days later 10,000 people have downloaded a voucher generator...
Michael Borgwardt
Using timestamps is even worse - no less predictable, but now you're generating duplicates all the time.
Michael Borgwardt
+3  A: 

I would say to use a "perfect hash" - http://en.wikipedia.org/wiki/Perfect_hash_function combined with a 4-digit random number...

So just increment your voucher code each time, then hash it, add a 4 digit random number and I would also add a check digit to the end (as Alix Axel suggested).

This would be very secure with no clashes - for example if someone worked out your hashing algorithm, they would also have to guess the 4-digit code at the end...

Mark
+2  A: 

Programming Pearls has several examples of algorithms to generate sets of random numbers, you should read it if you're interested in this kind of problem.

The book shows that if you generate m random numbers with value less than n, the simple approach of generating numbers and throwing out duplicates will generate no more than 2m random numbers if m < n / 2. Here it is, in C++:

void gensets(int m, int n)
{
    set<int> S;
    set<int>::iterator i;
    while (S.size() < m) {
        int t = bigrand() % n;
        S.insert(t);
    }
    for (i = S.begin(); i != S.end(); ++i)
        cout << *i << "\n";
}

Obviously, if you're worried about people guessing values, you will want m to be much less than n / 2.

There's even a set-based algorithm to generate m random numbers less than n with each value being equally likely, no duplicates, and a guarantee not to generate more than m random numbers:

void genfloyd(int m, int n)
{
    set<int> S;
    set<int>::iterator i;
    for (int j = n-m; j < n; j++) {
        int t = bigrand() % (j+1);
        if (S.find(t) == S.end())
            S.insert(t); // t not in S
        else
            S.insert(j); // t in S
    }
    for (i = S.begin(); i != S.end(); ++i)
        cout << *i << "\n";
}

The order of the numbers isn't random, though, so this is probably not a good choice for you.

RossFabricant
+6  A: 

Use an N-bit serial number R, combined with an M-bit hash H of the concatenated pair (R, S) where S is some secret "salt" S which you do NOT publish. Then encode the pair (R,H) alphanumerically in any reversible way you like. If you like algorithms like MD5* or SHA, but the bit count is too high, then just take the M least significant bits of a standard hash algorithm.

You can verify easily: decode the alphanumeric encoding so you can see R and H. Then compute H' = hash(R+S) and verify that H = H'.

edit: R can be an incrementing serial number or random or whatever, just make sure you use each value not more than once.

*before someone says "MD5 is broken", let me remind you that the known weaknesses for MD5 are collision attacks, and not preimage attacks. Also, by using an unpublished, secret salt value, you deny an attacker the ability to test your security mechanism, unless he/she can guess the salt value. If you feel paranoid, pick two salt values Sprefix and Ssuffix, and calculate the hash of the concatenated triple (Sprefix,R,Ssuffix).

Jason S
But what advantage does a cryptographic hash have over random numbers?
Jason Orendorff
simplicity! pseudorandom numbers used for this purpose may be unguesssable, but they would still need to be unique and verifiable.
Jason S
+1  A: 

This is a summary of the best bits of all the other answers. :)

You need to generate gift card numbers that are:

  • unique
  • unguessable

Random numbers are unguessable but not necessarily unique. The numbers produced by various algorithms are unique but guessable (the algorithm can be reverse-engineered). I don't know of a single algorithm that gives both properties, and because of the need to defy reverse engineering, it falls in the domain of cryptography. Non-experts, of course, shouldn't try to design cryptosystems.

Fortunately you don't have to get both properties from the same algorithm. Your gift card codes can consist of two parts: a part that is unique (generated using a linear congruential generator, perhaps, or modulo arithmetic, or even just an integer that you increment each time) and a part that is unguessable (just random numbers).

Jason Orendorff
You've noted "unique" and "unguessable" but you're missing the property of verifiability. Your solution would work, although it would require a database storage to map each unique ID to a single valid random number.
Jason S
Jason S: You have to store the codes in a database anyway, to record the amount of the gift and the amount already spent.
Jason Orendorff
A: 
Norman Ramsey