views:

1668

answers:

11

I did this to test the randomness of randint:

>>> from random import randint
>>>
>>> uniques = []
>>> for i in range(4500):  # You can see I was optimistic.
...     x = randint(500, 5000)
...     if x in uniques:
...         raise Exception('We duped ' + str(x) + ' at iteration number ' + str(i))
...     uniques.append(x)
...
Traceback (most recent call last):
    File "(stdin)", line 4, in (module)
Exception: 'We duped 4061 at iteration number 67

I tried about 10 times more and the best result I got was 121 iterations before a repeater. Is this the best sort of result you can get from the standard library?

+34  A: 

Before blaming Python, you should really brush up some probability & statistics theory. Start by reading about the birthday paradox

By the way, the random module in Python uses the Mersenne twister PRNG, which is considered very good, has an enormous period and was extensively tested. So rest assured you're in good hands.

Eli Bendersky
How good of hands? I did the same test above with randrange(5000) and got to 13 iterations on the first try, and only after 8 tries did I get past 100. The odds of getting that on the 13th try are technically 1/384, but I got lucky.
orokusaki
+19  A: 

If you don't want repetative one, generate sequential array and use random.shuffle

S.Mark
+1 - this will actually achieve what the OP appears to want.
ConcernedOfTunbridgeWells
God I love `random.shuffle`. It's one of the cores of my project :)
Az
+2  A: 

Python's random implementation is actually quite state of the art:

bayer
+89  A: 

The Birthday Paradox, or why PRNGs produce duplicates more often than you might think.


There are a couple of issues at play in the OP's problem. One is the birthday paradox as mentioned above and the second is the nature of what you are generating, which does not inherently guarantee that a given number will not be repated.

The Birthday Paradox applies where given value can occur more than once during the period of the generator - and therefore duplicates can happen within a sample of values. The effect of the Birthday Paradox is that the real likelihood of getting such duplicates is quite significant and the average period between them is smaller than one might otherwise have thought. This dissonance between the percived and actual probabilies makes the Birthday Paradox a good example example of a cognitive bias, where a naive intuitive estimate is likely to be wildly wrong.

A quick primer on Pseudo Random Number Generators (PRNGs)

The first part of your problem is that you are taking the exposed value of a random number generator and converting it to a much smaller number, so the space of possible values is smaller. Although some pseudo-random number generators do not repeat values during their period this transformation changes the domain to a much smaller one. The smaller domain invalidates the invariant so you can expect a significant likelihood of repeats.

Some algorithms, such as the linear congruential PRNG (A'=AX|M) do guarantee uniqueness for the entire period because whole of the accumulator state is exposed in the random number generated (i.e. there is no additional state held in the PRNG). In this case, a number cannot repeat within the period as a given value can only imply one possible successive value - the value produced is solely a function of the previous value. Therefore each value can only occur once within the period of the generator. However, the period of such a PRNG is relatively small (about 2^30 for typical implementations of the Linear Congruential algorithm) and cannot possibly be larger than the number of distinct values that can be generated.

In the OP's problem the Mersenne Twister algorithm (used in Python's random module) has a very long period (much greater than 2^32) and thus does not provide a guarantee that values (which are returned as 32 bit ints) will not be repeated during this period. Unlike a Linear Congruential PRNG, the number produced is not purely a function of the previous value; the accumulator contains additional state that is used in generating the next number.

The Merseene Twister is a popular algorithm for PRNGs because it has good statistical and geometric properties and a very long period. However the MT algorithm is not cryptographically secure; it is relatively easy to infer the internal state of the generator by observing a sequence of numbers.

  • Good statistical properties means that the numbers generated by the algorithm are evenly distributed with no numbers having significantly higher probabilities of appearing than others.

  • Good geometric properies means that sets of n numbers do not lie on a hyperplane in n dimensional space. A PRNG with poor geometric properties can generate spurious correlations in simulation models, which can distort the results.

  • Long period means that you can generate a lot of numbers before the sequence generated wraps around to the start, which is also a desirable attribute for large simulation models.

The period of the MT19337 algorithm is 2^19337 - 1, so a 32 bit integer produced by the PRNG cannot possibly represent enough discrete values for it not to repeat during the period. In this case repeating is inevitable and the birthday paradox applies.

Other algorithms such as Blum Blum Shub are used for cryptographic applications, but may be unsuitable for simulation or general random number applications. Cryptographically secure PRNGs may be expensive (perhaps requiring bignum calculations) or may not have good geometric properties. In the case of this type of algorithm the primary requirement is that it should be computationally infeasible to infer the internal state of the generator by observing a sequence of values.

The Birthday Paradox in a nutshell

This problem is originally defined as the probability of any two people in the room sharing the same birthday. The key point here is that any two people in the room could share a birthday. People tend to naively misinterpret the problem as the probability of someone in the room sharing a birthday with a specific individual, which is the source of the cognitive bias that often causes people to underestimate the probability. This is the incorrect assumption - there is no requirement for the match to be to a specific individual and any two individuals could match.

This graph shows the probability of a shared birthday as number of people in the room increases.  For 23 people the probability of two sharing a birthday is just over 50%.

The probability of a match occurring between any two individuals is much higher than the probability of a match to a specific individual as the match does not have to be to a specific date. Rather, you only have to find two individuals that share the same birthday. From this graph (which can be found on the wikipedia page on the subject), we can see that we only need 23 people in the room for there to be a 50% chance of finding two that match in this way.

From the Wikipedia entry on the subject we can get a nice summary. In the OP's problem we have 4,500 possible 'birthdays', rather than 365. For a given number of random values generated (equating to 'people') we want to know the probability of any two identical values appearing within the sequence.

Computing the likely effect of the Birthday Paradox on the OP's problem

For a sequence of 100 numbers, we have (100 * 99) / 2 = 4950 pairs (see Understanding the Problem) that could potentially match (i.e. the first could match with the second, third etc., the second could match the third, fourth etc. and so on), so the number of combinations that could potentially match is rather more than just 100.

From Calculating the Probability we get an expression of 1 - (4500! / (4500**100 * (4500 - 100)!). The following snippet of Python code below does a naive evaluation of the probability of a matching pair occurring.

# === birthday.py ===========================================
#
from math import log10, factorial

PV=4500          # Number of possible values
SS=100           # Sample size

# These intermediate results are exceedingly large numbers;
# Python automatically starts using bignums behind the scenes.
#
numerator = factorial (PV)          
denominator = (PV ** SS) * factorial (PV - SS)

# Now we need to get from bignums to floats without intermediate
# values too large to cast into a double.  Taking the logs and 
# subtracting them is equivalent to division.
#  
log_prob_no_pair = log10 (numerator) - log10 (denominator)

# We've just calculated the log of the probability that *NO*
# two matching pairs occur in the sample.  The probability
# of at least one collision is 1.0 - the probability that no 
# matching pairs exist.
#
print 1.0 - (10 ** log_prob_no_pair)

This produces a sensible looking result of p=0.669 for a match occuring within 100 numbers sampled from a population of 4500 possible values (Maybe someone could verify this and post a comment if it's wrong). From this we can see that the lengths of runs between matching numbers observed by the OP seems to be quite reasonable.

Footnote: using shuffling to get a unique sequence of pseudo-random numbers

See this answer below from S. Mark for a means of getting a guaranteed unique set of random numbers. The technique the poster refers to takes an array of numbers (which you supply, so you can make them unique) and shuffles them into random order. Drawing the numbers in sequence from the shuffled array will give you a sequence of pseudo-random numbers that are guaranteed not to repeat.

ConcernedOfTunbridgeWells
@ConcernedOfTunbridgeW Thanks for not getting mad at me for asking a question. Great answer too.
orokusaki
Wow answer. Where do I send the check ?
e-satis
One correction: LCG-based PRNGs, used properly, do *not* guarantee unique output for the complete cycle. For example, the traditional Turbo Pascal LCG has (IIRC) 31 bits of internal state, but it only generates 15-bit numbers which can and do repeat within a single cycle.
Porculus
@e-satis unless you have a particular distaste for wikipedia politics, much of the reference material for this article came from there. You might donate $2.56 to them for their help. The inline images of the mathematical expressions were done by mathurl.com, although I'm not sure whether the owner of the site takes donations.
ConcernedOfTunbridgeWells
Excellent Reply ! ...WOW !
Arkapravo
+11  A: 

True randomness definitely includes repetition of values before the whole set of possible values is exhausted. It would not be random otherwise, as you would be able to predict for how long a value would not be repeated.

If you ever rolled dice, you surely got 3 sixes in row quite often...

Ber
+1  A: 

You have defined a random space of 4501 values (500-5000), and you are iterating 4500 times. You are basically guaranteed to get a collision in the test you wrote.

To think about it another way:

  • When the result array is empty P(dupe) = 0
  • 1 value in Array P(dupe) = 1/4500
  • 2 values in Array P(dupe) = 2/4500
  • etc.

So by the time you get to 45/4500, that insert has a 1% chance of being a duplicate, and that probability keeps increasing with each subsequent insert.

To create a test that truly tests the abilities of the random function, increase the universe of possible random values (eg: 500-500000) You may, or may not get a dupe. But you'll get far more iterations on average.

sfrench
Your math is incorrect because of the birthday problem. See other answers. After 45 inserts, you have a 1% chance of having repeated the first insert, but you also have 44 other distinct inserts that you might have repeated.
jcdyer
+4  A: 

You are generating 4500 random numbers from a range 500 <= x <= 5000. You then check to see for each number whether it has been generated before. The birthday problem tells us what the probability is for two of those numbers to match given n tries out of a range d.

You can also invert the formula to calculate how many numbers you have to generate until the chance of generating a duplicate is more than 50%. In this case you have a >50% chance of finding a duplicate number after 79 iterations.

liwp
+3  A: 

That's not a repeater. A repeater is when you repeat the same sequence. Not just one number.

Lennart Regebro
+9  A: 

alt text

Nimbuz
At last,an explanation I can understand :)
extraneon
A: 

For anyone else with this problem, I used uuid.uuid4() and it works like a charm.

orokusaki
The question then might have been better phrased as "I want to generate a series of non-repeating numbers, Python's randint() doesn't seem to do that - what does?" rather than "Python's random number generator is bad" :-) Assuming uuid4() is truly random, it may still repeat - just really unlikely. What are the actual properties you want from the numbers? Non-repeating? Random? (Pick one.) Not-repeating-often? (Use a bigger int range, effectively all uuid4 is, it seems.) What exactly do you want to use the numbers _for_ is the real question.
agnoster
@agnoster I really didn't intend on insulting Python, but Random: Lack of predictability, without any systematic pattern, and Repeating Pattern: A pattern of a group of items that repeats over and over. See, the random generator is not random if it repeats because it then has a pattern.
orokusaki
@orokusaki Your definition of "random" is wrong. Seriously, go back and re-read the bits on the birthday paradox. A truly random number generator will still have repeats much more frequently than you expect by intuition. As @ConcernedOfTunbridgeW points out, the probability of getting a repeat in the range 500-5000 within the first 100 numbers is ~66%, not at all inconsistent with what you observed, I believe. Randomness does *not* mean "without repeats", it just means... well, random. In fact, if you guarantee a lack of repeats the generator must be *less* random in order to enforce that.
agnoster
@orokusaki The question about what you want these numbers for still stands. If you specifically want non-repeating numbers, why? uuid4() is (if it's truly random) no different from randint() with a very very large range. If you want the sequence to be hard to guess, eliminating repeats actually hurts you, because once I see the number, say, 33, I know that whatever comes next *doesn't* have 33 in it. So enforcing non-repetition actually makes your sequence *more* predictable - do you see?
agnoster
+3  A: 

As an answer to the answer of Nimbuz:

http://xkcd.com/221/

alt text

Curd
RFC 1149.5 specifies 4 as the standard IEEE-vetted random number.
Zano
@Curd : I bet a non-programmer will never get this one !
Arkapravo
@Arkapravo - Ridiculous...
orokusaki