views:

774

answers:

9

Hiya All,

I am trying to come up with a "smart" and "secure" way of generating about 63million unique codes to be used in a competition. The codes will be about 10 characters long.

Has anyone done anything similar or aware of any "hurdles" that might arise from this issues? How do we minimize the probability of someone being able to guess the codes?

This project will be done in PHP, but that won't really matter, it's more of the logic that's an issue here.

Any feedback would be really appreciated.

UPDATE Just to clarify it will be 10 characters of case insensitive Alpha Numeric Values. A-Z0-9

A: 

Use a secure random number generator.

mP
Probability of guessing: 63*10^6 / 10^10 = 1 / 159
kgiannakakis
i dunno how someone +1 on this as it's a little unhelpful if not somewhat patronising :)
Shadi Almosri
I don't think anybody said the characters must be numbers. [0-9a-z] gives 63*10^6 / 36^10 = 17/1000000000. [0-9a-Z] gives 63*10^6 / 62^10 = 75/1000000000000.
David Johnstone
Alpha Numeric should be fine. but i think case insensitive would be best.
Shadi Almosri
@David Johnstone: The answer only refers to random number generator, which can only give numeric results. Alpha-numeric codes is of course a solution, but you need something more than a random number generator.
kgiannakakis
But random number generators are really random bit string generators, and these can be displayed any way you want - base 2, base 8, base 10, base 16, base 36...
David Johnstone
Yes, but that's not part of the answer and it shouldn't be considered obvious.
kgiannakakis
@ShadiA secure random number generator satisfies your requirements. Use it to generate your magic numbers and keep track of uniqueness. Its a very simple problem.
mP
Doesn't need to be secure, in fact. There's one code per product sold, with no discernable order. Therefore it's practically infeasible to obtain a sequence. Without a sequence of observations, all RNGs are secure. Even http://xkcd.com/221/
MSalters
A: 

if they are for unique participants, you can hash each participants name (and/or) details and then cut off till the first 10 characters.

OrangeRind
Hashes, technically, may not always be unique though.
Jonathan Sampson
The codes will be distributed on products, and then claimed, so there is no connection with a participant to create the values.
Shadi Almosri
@Shadi - oh. my apologies then.@Jonathan - sure, thats why I asked him to add more details into making the hash to make the input string long enough. although I don't know how high would the probability of clashing in 63m would be.But still Other answers are much better. I also didn't know UUIDs.:)
OrangeRind
A: 

Maybe this will help you a little: Universal Unique Identifier

The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else.

Daff
repair the linkyou forgot to add the last 'r':)
OrangeRind
In UUID v4 this is just a waste of entropy — you are throwing most of data away. In other UUID versions, the data may be not so random (NIC MAC address and system time — or hashes of them — are not really good random data sources).
drdaeman
Ups thanks... damn Copy ) Yeah I know that it is actually pretty hard for a machine to generate a big amount of random numbers. The best for that big of an amount of really random data would probably be to get some measurement data from a physical experiment (radioactive decay) cause these are supposed to be naturally random (at least more than a machine generator).
Daff
+2  A: 

See this link for generating alpha-numeric strings in PHP. It uses an alphabet of 36 characters, which should be secure enough. However uniqueness isn't guaranteed. I guess you could use a Set to implement this. Since this is an one-time operation only, the time delay for testing for duplicates shouldn't be a big deal.

kgiannakakis
Yes you're right, as the data is being setup it's not too hard to check if a code exists when placing it in the DB, and discarding if it doesn't.
Shadi Almosri
It's faster to generate 64 million codes, sort, and check for duplicates. Only O(N log N).
MSalters
@MSlaters: How would you do the duplicate check?
Shadi Almosri
I think this is the best solution so far, modifying the function to only include characters that aren't confusing to the user.
Shadi Almosri
You will compare the next value with the previous one.
kgiannakakis
+5  A: 

Generate a set of truly random, unique 64-bit numbers over the range 0 - 250-1. You'll need to keep track of the ones you have seen and reject duplicates. Use each 5 bits of the lower 50 bits of this number pull from a 32-character alphabet -- basically all the letters in the English alphabet (upper or lowercase) minus L and O plus the digits 2-9 (this reduces confusion between l/1 and 0/O). For 63 million codes, this would give you a 0.000006% probability (63,000,000/250) of choosing a valid code sequence randomly.

I've also done this using an autogenerated, primary key (int) and bit-interleaving it with a 32-bit random value. In this case I used the full 64-bits to generate 13 characters from the alphabet and added two random characters at fixed positions for a 15-character code. When redeeming the code you reverse the algorithm to extract the key and the randomness, throwing away the two extra random characters, then compare the randomness to that found stored with the key to validate the code.

tvanfosson
+9  A: 

If I understand you correctly, you want to create 63 milion codes of 10 digits that have a low "guess factor".

There are 10,000,000,000 valid combinations. Of these 63,000,000 are price numbers. 63 / 10,000 = 0.0063. So each guess has a 0,63% chance of success. Does not sound high, but with brute force, the numbers are quite easy to obtain.

Are you sure the 63 on 10,000 ratio is good enough?

Gamecat
This is a good point.
Mark Cooper
A good reason not to use decimal digits; see kgiannakakis
MSalters
He specifically said characters, not digits. A good warning, but I think he already gets that.
tvanfosson
He said in response to another answer that alpha-numeric would most likely be used.
David Johnstone
They will be Alpha Numeric...
Shadi Almosri
Also in place will be a form of anti brute force protection....
Shadi Almosri
The alpha part was not clear at first. Now the guess chance is 0.0000017%. Looks a lot better to me ;-).
Gamecat
A: 

You say the codes are 10 'characters' long, but what is your character set?

If it's just digits, then (@Gamecat) it'll probably be a little too easy to randomly guess a code.

On the other hand, if the character set is letters + digits, then you've got plenty of safety.

In any case generate using secure random number generator, and check for duplicates before putting in a database.

Douglas Leeder
+22  A: 

Syntax:

You probably will have people copying these codes, so that means these codes should be easy to copy. 10^10 is too small, as Gamecat points out. kgiannakakis has a better idea, but that causes another problem: "1" looks a lot like "I". "0", "C", "O" and "Q" are also quite similar. This is not a big problem. Define a safe alfabet: "0123456789ABDEFGHJKLMNPRSTUVXYZ" (leaves out COIQ) From the comments: depending on the fonts you pick, 5/S and U/V may also be visually ambiguous; replace as required. This is a 32 symbol (5 bit) code. A 10 character code is a 50 bits number. Those should be fairly trivial to generate, sort, copy, compare etc. Chances of being guessed are about 0.63E-7

Since the codes are too long to remember, users will need a resting point when copying them. So split the string in two or three parts and make sure the input field matches this breakdown.

E.g. AKG3L-45TEE => two groups of 5, and even if you can't remember 5 chars it's a lot easier to find the point back where you stopped reading.


How to generate them:

This is fairly simple. You don't need a particularly sophistciated algorithm to generate candidates. You can generate 10 random numbers per code needed, take 5 bits from each number (typically the middle bits are best, e.g. (rand()/64) modulo 32). Use this value [0-31] as index into your alphabet. Create a database table with this string as a primary key, and insert until the table has 63 million entries. You probably want to add "generated on" and "redeemed on" dates to this table.

MSalters
+1 Good point on the removing of confusing values
Shadi Almosri
And having 32 symbols could simplify some things too
David Johnstone
The letter ess 'S' and the number five '5' are also homomorphic.
dwhall
The set of similar characters depends heavily on the font, in some fonts U and V can also be hard to distinguish.I'd also sacrifice a symbol worth of the codespace to error detection, so you can inform the user that he probably mistyped the code. This still leaves plenty of room to make guessing infeasible if some trivial throttling is used.
Ants Aasma
Thanks for this, i think with slight mod using info from other answers on here, this solution is resolved. Appreciated :)) thanks you!
Shadi Almosri
+3  A: 

Be careful when using alphanumerics for codes, as you don't want to accidentally generate anything confusing or embarrassing. To avoid confusion I suggest removing 1 and L, 0 and O, and maybe 8 and B. To avoid embarrassment consider removing all vowels so that you cannot accidentally spell anything (use your imagination here).

Alan