ansaurus

Question

Verify Knuth shuffle algorithm is as unbiased as possible

Answer 1

+2 A:

Have a look at this article from Jeff Atwood:

Shuffling
http://www.codinghorror.com/blog/archives/001008.html

See also:

The Danger of Naïveté
http://www.codinghorror.com/blog/archives/001015.html

Robert Harvey 2009-11-06 04:22:22

I'm aware of the general limitations of pseudorandom number generators; do you think I should look into moving to Eric Lippert's algorithm (assign each index a random floating point value and sort) or beefing up my random number generator, perhaps by tapping a cryptography API or implementing a better PRNG like the Mersenne Twister or Blum Blum Shub?

Adam Maras 2009-11-06 05:11:35

Actually, I missed the best post. See: The Danger of Naïveté -- http://www.codinghorror.com/blog/archives/001015.html

Robert Harvey 2009-11-06 06:45:09

Answer 2

+5 A:

Well, one thing you could do as a black-box test is take some relatively small array size, perform a large number of shuffles on it, count how many times you observe each permutation, and then perform Pearson's Chi-square test to determine whether the results are uniformly distributed over the permutation space.

On the other hand, the Knuth shuffle, AKA the Fisher-Yates shuffle, is proven to be unbiased as long as the random number generator that the indices are coming from is unbiased.

dsimcha 2009-11-06 04:23:36

I'll look into permutation counting; Pearson's Chi-square test (as far as I've read up on it) looks a little out of my league as far as mathematics are concerned, but I'll continue reading up on it.

Adam Maras 2009-11-06 05:14:52

You don't really need to go as far as Chi-square to test whether the shuffle is biased. Something as simple as standard deviation will work with the same type of test (you should see the standard deviation approach zero as the number of shuffles increases).

Greg Beech 2009-11-06 09:42:29

@Adam: I just thought it would be easy to use Chi-square because I wrote a library function that did Pearson's Chi-square a while back. I've since forgotten the details of how it works, but I still remember how to use it. The only problem is it's written in D, not C++. There should be C++ libs that do this, but if you don't have one conveniently available (installed, etc) already, then it may be overkill.

dsimcha 2009-11-06 14:28:08

@dsimcha: would you be willing to share your D library for Chi-square? I'm sure I could port it to C++ (for personal use only, of course) without much trouble. (You can reach me at firstname dot lastname at gmail.)

Adam Maras 2009-11-06 22:23:45

http://dsource.org/projects/dstats

dsimcha 2009-11-06 22:57:59

But it's actually a whole statistics lib. The actual chi-square function is pretty short, but you'd have to port or find C++ equivalents of a whole bunch of lower-level functionality, like incomplete gamma functions. It's probably overkill, but I figured I'd share it with you just in case it isn't. Alternatively, you can just have your C++ code print out the values and paste them into a statistics package like R, or even just eyeball them instead of doing a formal test, if this testing doesn't have to be fully automated.

dsimcha 2009-11-06 23:00:38

Answer 3

+1 A:

The Knuth shuffle itself is provably unbiased: There exists exactly one series of operations that yields each possible shuffle. It's unlikely your PRNG has enough bits of state to express every possible shuffle, however, so the real question is if your PRNG is 'random enough' with regards to the set of shuffles it will actually produce, and whether your seeding strategy is secure enough.

Only you can decide this, as it depends on the consequences of a shuffle that isn't random enough. If you're dealing with real money, for example, I would suggest switching to a cryptographically secure PRNG and improving your seeding strategy. Although most built in PRNGs generate good randomness, they're also quite easy to reverse engineer, and calling seed() with no arguments is likely seeding based on the current time, which is pretty easy to predict.

Nick Johnson 2009-11-06 09:37:19

Answer 4

+5 A:

If I see that right, your random::get (max) doesn't include max.

This line:

draw_t push_index = random::get( pull_index );

then produces a "classical" off-by-one error, as your pull_index and push_index erroneously can never be the same. This produces a subtle bias that you can never have an item where it was before the shuffle. In an extreme example, two-item lists under this "shuffle" would always be reversed.

Svante 2009-12-08 13:26:03

+1. Good eye. This deserves more upvotes.

Jason Orendorff 2009-12-09 16:39:38

ansaurus

tags:

views:

answers:

Verify Knuth shuffle algorithm is as unbiased as possible

related questions