Existing answers do a good job of addressing the question's specific, but I think it's worth mentioning a side issue: why you're particularly likely to want to pass an alternative "random generator" to shuffle
as opposed to other functions in the random
module. Quoting the docs:
Note that for even rather small
len(x), the total number of
permutations of x is larger than the
period of most random number
generators; this implies that most
permutations of a long sequence can
never be generated.
The phrase "random number generators" here refers to what may be more pedantically called pseudo-random number generators -- generators that give a good imitation of randomness, but are entirely algorithmic, and therefore are known not to be "really random". Any such algorithmic approach will have a "period" -- it will start repeating itself eventually.
Python's random
module uses a particularly good and well-studied pseudo-random generator, the Mersenne Twister, with a period of 2**19937-1
-- a number that has more than 6 thousand digits when written out in decimal digits, as len(str(2**19937-1))
will confirm;-). On my laptop I can generate about 5 million such numbers per second:
$ python -mtimeit -s'import random' 'random.random()'
1000000 loops, best of 3: 0.214 usec per loop
Assuming a much faster machine, able to generate a billion such numbers per second, the cycle would take about 10**5985 years to repeat -- and the best current estimate for the age of the Universe is a bit less than 1.5*10**12 years. It would thus take an almost-unimaginable number of Universe-lifetimes to reach the point of repetition;-). Making the computation parallel wouldn't help much; there are estimated to be about 10**80 atoms in the Universe, so even if you were able to run such a billion-per-second generator on each atom in the Universe, it would still take well over 10**5800 Universe-lifetimes to start repeating.
So, you might be justified in suspecting that this worry about repetition is a tiny little bit of a theoretical, rather than practical, issue;-).
Nevertheless, factorials (which count the permutations of a sequence of length N) also grow pretty fast. Using natural logarithms (since 2**19937
is too large to fit in a double;-) and Stirling's approximation:
>>> from math import log
>>> def stirling(n): return n*log(n)-n
...
>>> 19937*log(2)
13819.275338823629
>>> stirling(2081)
13819.096562725152
>>> stirling(2082)
13826.737406782171
so we see that the Mersenne Twister might be able to produce all permutations of a sequence of length 2081, but definitely not of one of length 2082 or higher. Were it not for the "lifetime of the Universe" issue, the docs' worry about "even rather small len(x)" would be justified -- we know that many possible permutations can never be reached by shuffling with such a pseudo-RNG, as soon as we have a reasonably long sequence, so one might worry about what kind of bias we're actually introducing with even a few shuffles!-)
os.urandom mediates access to whatever sources of physical randomness the OS provides -- CryptGenRandom on Windows, /dev/urandom on Linux, etc. os.urandom
gives sequences of bytes, but with the help of struct it's easy to make them into random numbers:
>>> n = struct.calcsize('I')
>>> def s2i(s): return struct.unpack('I', s)[0]
...
>>> maxi = s2i(b'\xff'*n) + 1
>>> maxi = float(s2i(b'\xff'*n) + 1)
>>> def rnd(): return s2i(os.urandom(n))/maxi
Now we can call random.shuffle(somelist, rnd)
and worry less about bias;-).
Unfortunately, measurement shows that this approach to RNG is about 50 times slower than calls to random.random()
-- this could be an important practical consideration if we're going to need many random numbers (and if we don't, the worry about possible bias may be misplaced;-). The os.urandom
approach is also hard to use in predictable, repeatable ways (e.g. for testing purposes), while with random.random()
you need only provide a fixed initial random.seed
at the start of the test to guarantee reproducible behavior.
In practice, therefore, os.urandom
is only used when you need "cryptographic quality" random numbers - ones that a determined attacker can't predict - and are therefore willing to pay the practical price for using it instead of random.random
.