ansaurus

Question

Does Repeating a Biased Random Shuffle Reduce the Bias?

Answer 1

+2 A:

A couple of points:

1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.

Daniel 2010-09-29 22:44:52

The shuffle listed in the original question is Fisher-Yates and is implemented correctly.

NickLarsen 2010-09-29 22:52:01

I found 1.) interesting because I thought that most coders already knew how to shuffle correctly. However, I was looking at the sorting test suite for open jdk and saw that they were just doing the naive shuffle many times over again when shuffling test data. I wondered if they had any reason for it, but I guess not.

Justin Peel 2010-09-29 23:06:37

About point 2, I want to use each random shuffle, but then simply reuse it for the next shuffle. Not shuffle 2 or more times before using the permutation.

JohnPS 2010-09-29 23:08:33

As for Jeff's graphs, using smaller sets actually shows the differences easily, but for shuffling say, a deck of cards, N=52, the amount of times you would have to shuffle to show statistically significant differences is pretty astounding. So for those, its much easier to use proofs.

NickLarsen 2010-09-29 23:28:58

I agree that a simple test would be to see that every permutation (say of 5 out of 25) gets produced approximately equally over a high sample size. With a deck of cards, like NickLarsen points out, this is harder to do, if we want complete permutations of 52.

JohnPS 2010-09-29 23:37:13

@NickLarsen: Depends how biassed your RNG is. If it alternates between odd and even outputs, you might be able to detect the bias with overwhelming confidence in a *single* shuffle of a 52-card deck.

Steve Jessop 2010-09-29 23:42:41

Answer 2

+1 A:

I can't completely answer your question, but this observation seemed too long for a comment.

What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.

Andrew 2010-09-29 22:47:54

That's an interesting observation. I'm guessing repeated permutations is usually not a problem. I think the "shuffling the previously shuffled array" could only make the period longer between repeats.

JohnPS 2010-09-29 23:01:12

Answer 3

+3 A:

if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations

This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.

As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.

I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.

NickLarsen 2010-09-29 23:23:40

No, he's right - if your RNG has a period of 10^9, it can't possibly generate 25! distinct sequences, and thus all shuffles are not equally probable. I wouldn't expect to see any systematic bias in which shuffles are possible, though, if the PRNG is even remotely good.

Nick Johnson 2010-09-30 09:25:34

I never said he was wrong. I pointed out that what he said only holds when using an RNG with a period less than the total permutations and the seed determines every successive number generated. There are a number of alternative RNG methods that do not use seeds or have periods. When you remove the dependency of a seed, my claims are correct.

NickLarsen 2010-09-30 13:04:47

Good point. I was only considering RNGs that produce a single sequence with a certain period (say 2^32) and the seed only determines the starting point in that sequence, such as simple linear congruential generators.

JohnPS 2010-09-30 21:08:17

Answer 4

+1 A:

My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).

So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?

Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).

But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.

Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.

For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.

My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).

If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.

Giuseppe Cardone 2010-09-30 00:30:20

Thanks for your answer. I think de-skewing would cost more than just using a slower, but better RNG.

JohnPS 2010-09-30 01:04:07

I edited my answer to clarify my point (it was too long for a comment). The bottom line is: the entropy of the algorithm can't be higher than the entropy provided by the RNG, thus even if the algorithm works with biased RNGs you should re-apply it a number of time that squeezes a sufficient number of really random bits - this computational effort can't be lower than the effort required to de-skew the RNG (and it is in fact probably much higher).

Giuseppe Cardone 2010-09-30 08:51:47

To shuffle a 25-element array we need to take the raw random numbers then apply modulus 25, 24, 23, ..., 2. This would appear to add randomness. Also reusing the previous shuffle adds to the "state", so can't we think of the RNG + "shuffle algorithm using previous shuffle" as a "random permutation generator" with more state than the underlying RNG, and hence more "randomness"?

JohnPS 2010-09-30 21:19:28

No, we can't. A (P)RNG is an algorithm that takes few truly random bits (the seed) and "spreads" their entropy over an long stream of bits. The state of the RNG assures that given a N bit seed, the RNG will output one of the possible 2^N outputs and that every different seeds results in a different output bitstream. Adding more state bits, in any way (including feeding them to an algorithm) does not generate entropy, in fact you are probably losing some because your additional state (in your case the cards' position) is not designed to map few entropy bits to a long pseudo-random bitstream.

Giuseppe Cardone 2010-09-30 22:12:07

It doesn't add randomness in the sense that if you know the starting state, then you can predict all successive permutations. But if you don't know the starting seed, are you saying that by looking at the output sequence of permutations, you can tell that they are not truly random just as easily as looking at the sequence of numbers produced by the underlying RNG? The period of the permutations is longer than the period of the RNG, so this test of randomness is harder to fool.

JohnPS 2010-10-01 19:14:36

I see your point. My gut feeling is that it would increase the space complexity but not the time complexity. But I'm not sure and I can't prove it, so my advice is: unless you can strongly justify why using multiple shuffles is the best choice, you should de-skew your RNG or use a better one - which is the best practice. As a side note, the output of many RNG can be predicted even without knowing their starting state, just by observing their past output (for example see http://www.springerlink.com/content/p4526x2j040m7j12/ http://portal.acm.org/citation.cfm?id=1290930.1290938 ).

Giuseppe Cardone 2010-10-02 00:07:18

There is a special class of RNGs that, given some assumptions, strongly guarantee that their output does not leak their internal state. Those RNGs are called CSPRNG (Cryptographically Strong RNG). If you can't risk that someone guesses the state of your RNG, you should use a CSPRNG. And some of them are quite fast too (for example, ISAAC).

Giuseppe Cardone 2010-10-02 00:15:24

Thanks for all the info. It's all very interesting, but I don't need a heavy duty cryptographic RNG. I really only need something that produces all possible permutations with equal likelihood. Using the previous shuffle on successive iterations seems like it can't hurt and is more efficient. And some simple tests should be good enough.

JohnPS 2010-10-02 04:48:25

Answer 5

A:

It depends entirely on the bias. In general I would say "don't count on it".

Biased algorithm that converges to non-biased:

Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.

Biased algorithm that stays biased:

Shuffle all elements except the last one. Permanently biased towards not moving the last element.

More General Example:

Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.

Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.

So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.

Strilanc 2010-09-30 04:39:32

"Shuffle all elements except the last one." But I ask what if the shuffle algorithm is not biased, but the underlying RNG is? Perhaps this is what you were addressing in the last part. I could believe that consecutive permutations could be correlated if the RNG is biased, but I think the RNG would need to be pretty bad to notice just by looking at the permutations. I guess testing is needed to know for sure.

JohnPS 2010-09-30 08:44:41

Assuming a shuffle function which has no unreachable permutations, the RNG can be used to simulate any algorithm you want.

Strilanc 2010-10-01 01:07:25

ansaurus

tags:

views:

answers:

Does Repeating a Biased Random Shuffle Reduce the Bias?

related questions