views:

138

answers:

5

I'd like to produce fast random shuffles repeatedly with minimal bias.

It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.

To shuffle an array a of n elements:
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

But what if the RNG is biased (but fast)?

Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.

My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?

My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.

Does anyone know of any research that addresses this?

As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?

I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?

+2  A: 

A couple of points:

1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.

Daniel
The shuffle listed in the original question is Fisher-Yates and is implemented correctly.
NickLarsen
I found 1.) interesting because I thought that most coders already knew how to shuffle correctly. However, I was looking at the sorting test suite for open jdk and saw that they were just doing the naive shuffle many times over again when shuffling test data. I wondered if they had any reason for it, but I guess not.
Justin Peel
About point 2, I want to use each random shuffle, but then simply reuse it for the next shuffle. Not shuffle 2 or more times before using the permutation.
JohnPS
As for Jeff's graphs, using smaller sets actually shows the differences easily, but for shuffling say, a deck of cards, N=52, the amount of times you would have to shuffle to show statistically significant differences is pretty astounding. So for those, its much easier to use proofs.
NickLarsen
I agree that a simple test would be to see that every permutation (say of 5 out of 25) gets produced approximately equally over a high sample size. With a deck of cards, like NickLarsen points out, this is harder to do, if we want complete permutations of 52.
JohnPS
@NickLarsen: Depends how biassed your RNG is. If it alternates between odd and even outputs, you might be able to detect the bias with overwhelming confidence in a *single* shuffle of a 52-card deck.
Steve Jessop
+1  A: 

I can't completely answer your question, but this observation seemed too long for a comment.

What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.

Andrew
That's an interesting observation. I'm guessing repeated permutations is usually not a problem. I think the "shuffling the previously shuffled array" could only make the period longer between repeats.
JohnPS
+3  A: 

if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations

This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.

As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.

I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.

NickLarsen
No, he's right - if your RNG has a period of 10^9, it can't possibly generate 25! distinct sequences, and thus all shuffles are not equally probable. I wouldn't expect to see any systematic bias in which shuffles are possible, though, if the PRNG is even remotely good.
Nick Johnson
I never said he was wrong. I pointed out that what he said only holds when using an RNG with a period less than the total permutations and the seed determines every successive number generated. There are a number of alternative RNG methods that do not use seeds or have periods. When you remove the dependency of a seed, my claims are correct.
NickLarsen
Good point. I was only considering RNGs that produce a single sequence with a certain period (say 2^32) and the seed only determines the starting point in that sequence, such as simple linear congruential generators.
JohnPS
+1  A: 

My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).

So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?

Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).

But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.

Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.

For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.

My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).

If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.

Giuseppe Cardone
Thanks for your answer. I think de-skewing would cost more than just using a slower, but better RNG.
JohnPS
I edited my answer to clarify my point (it was too long for a comment). The bottom line is: the entropy of the algorithm can't be higher than the entropy provided by the RNG, thus even if the algorithm works with biased RNGs you should re-apply it a number of time that squeezes a sufficient number of really random bits - this computational effort can't be lower than the effort required to de-skew the RNG (and it is in fact probably much higher).
Giuseppe Cardone
To shuffle a 25-element array we need to take the raw random numbers then apply modulus 25, 24, 23, ..., 2. This would appear to add randomness. Also reusing the previous shuffle adds to the "state", so can't we think of the RNG + "shuffle algorithm using previous shuffle" as a "random permutation generator" with more state than the underlying RNG, and hence more "randomness"?
JohnPS
No, we can't. A (P)RNG is an algorithm that takes few truly random bits (the seed) and "spreads" their entropy over an long stream of bits. The state of the RNG assures that given a N bit seed, the RNG will output one of the possible 2^N outputs and that every different seeds results in a different output bitstream. Adding more state bits, in any way (including feeding them to an algorithm) does not generate entropy, in fact you are probably losing some because your additional state (in your case the cards' position) is not designed to map few entropy bits to a long pseudo-random bitstream.
Giuseppe Cardone
It doesn't add randomness in the sense that if you know the starting state, then you can predict all successive permutations. But if you don't know the starting seed, are you saying that by looking at the output sequence of permutations, you can tell that they are not truly random just as easily as looking at the sequence of numbers produced by the underlying RNG? The period of the permutations is longer than the period of the RNG, so this test of randomness is harder to fool.
JohnPS
I see your point. My gut feeling is that it would increase the space complexity but not the time complexity. But I'm not sure and I can't prove it, so my advice is: unless you can strongly justify why using multiple shuffles is the best choice, you should de-skew your RNG or use a better one - which is the best practice. As a side note, the output of many RNG can be predicted even without knowing their starting state, just by observing their past output (for example see http://www.springerlink.com/content/p4526x2j040m7j12/ http://portal.acm.org/citation.cfm?id=1290930.1290938 ).
Giuseppe Cardone
There is a special class of RNGs that, given some assumptions, strongly guarantee that their output does not leak their internal state. Those RNGs are called CSPRNG (Cryptographically Strong RNG). If you can't risk that someone guesses the state of your RNG, you should use a CSPRNG. And some of them are quite fast too (for example, ISAAC).
Giuseppe Cardone
Thanks for all the info. It's all very interesting, but I don't need a heavy duty cryptographic RNG. I really only need something that produces all possible permutations with equal likelihood. Using the previous shuffle on successive iterations seems like it can't hurt and is more efficient. And some simple tests should be good enough.
JohnPS
A: 

It depends entirely on the bias. In general I would say "don't count on it".

Biased algorithm that converges to non-biased:

Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.

Biased algorithm that stays biased:

Shuffle all elements except the last one. Permanently biased towards not moving the last element.

More General Example:

Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.

Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.

So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.

Strilanc
"Shuffle all elements except the last one." But I ask what if the shuffle algorithm is not biased, but the underlying RNG is? Perhaps this is what you were addressing in the last part. I could believe that consecutive permutations could be correlated if the RNG is biased, but I think the RNG would need to be pretty bad to notice just by looking at the permutations. I guess testing is needed to know for sure.
JohnPS
Assuming a shuffle function which has no unreachable permutations, the RNG can be used to simulate any algorithm you want.
Strilanc