views:

447

answers:

8

I have been doing a little recreational holiday computing. My mini-project was a simulation of the Italian game of "tomboli". A key building block was a simulation of the following process;

The game is controlled by a man with a bag of 90 marbles, numbered 1 to 90. He draws marbles one by one randomly from the bag, each time calling out the marble number to the players.

After a little thought I wrote the following code for this building block;

// NBR marbles, numbered 1...NBR are in a bag. Simulate randomly
//  pulling them from the bag, one by one, until the bag is empty
void bag( int random_sequence[NBR] )
{
    int i;

    // Store each marble as it is pulled out
    int *store = random_sequence;

    // Array of marbles still in the bag
    int not_yet_pulled[NBR];
    for( i=0; i<NBR; i++ )
        not_yet_pulled[i] = i+1;    // eg NBR=90; 1,2,3 ... 90

    // Loop pulling marbles from the bag, one each time through
    for( i=NBR; i>=1; i-- )
    {
        int x = rand();
        int idx = x%i;  // eg i=90 idx is random in range 0..89
                        // eg i=89 idx is random in range 0..88
                        //            ...
                        // eg i=1  idx is random in range 0..0
                        //  (so we could optimize when i=1 but not
                        //   worth the bother)
        *store++  = not_yet_pulled[idx];

        // Replace the marble just drawn (so it cannot be pulled again)
        //  with the last marble in the bag. So;
        //  1) there is now one less marble in the bag
        //  2) only marbles not yet pulled are still in the bag
        // If we happened to pull the last marble in the bag, this is
        // not required but does no harm.
        not_yet_pulled[idx] = not_yet_pulled[i-1];
    }
}

I know there are subtleties and traps all over the place in game simulation with random numbers, so although I am pretty happy with my code my confidence is a little less than 100%. So my questions are;

1) Is there anything wrong with my code ?

2) [if the answer to 1) is no] Am I unknowingly using a standard shuffling algorithm ?

3) [if the answer to 2) is no] How does my algorithm compare to standard alternatives ?

EDIT Thanks to all who answered. I am going to accept Aidan Cully's answer because it turns out I was rediscovering the Fisher-Yates algorithm, and revealing that gets to the heart of the matter. Of course it is no surprise I could have saved myself time and effort by doing some research up front. But on the other hand it was a fun hobby project. The rest of the simulation was routine, this was the most interesting part, and I would have deprived myself of enjoyment by not having a go myself. Additionally, I was trying to simulate a man taking marbles from a bag, and it was fairly late in the piece that I realized that the situation was exactly analagous to shuffling cards.

Another point of interest is that there is a small flaw, identified by Ken who points out that the oft repeated pattern rand()%N is not a really good way of picking a random number from the range 0..N-1.

Finally, my version of Fisher-Yates lacks the elegant trick that allows the nice property of shuffling in place. As a result, my algorithm would end up with an equally random but reversed shuffle.

+7  A: 

You're using the Fisher-Yates shuffling algorithm.

Aidan Cully
This is basically correct, the only difference is that you're storing into a new array, whereas the F-Y algorithm (as described, e.g. by rossfabricant above) stores into the empty part at the end of the input array.
Edmund
@Edmund: good explanation.
Mark Byers
Well, it's not **exactly** the fFisher-Yates shuffling algorithm, but it's so close you can sensibly say it is, so i didn't understood the -1 and gave back some points.
kriss
@Edmund: Actually the FY is a mathamatical algorithm and as such does not have the concept of a storage mechanism. The standard implementation as described by Durstenfeld though does do it in place.
Martin York
+10  A: 

Use the Fisher-Yates-Knuth shuffle:

public static void shuffle(int[] array) 
{
    Random rng = new Random();       // java.util.Random.
    // n is the number of items left to shuffle
    for (int n = array.length; n > 1; n--) 
    {
        // Pick a random element to move to the end
        int k = rng.nextInt(n);  // 0 <= k <= n - 1.
        // Simple swap of variables
        int tmp = array[k];
        array[k] = array[n - 1];
        array[n - 1] = tmp;
    }
}

It looks like your code might work, but I'm not sure. It's more obfuscated than the standard algorithm.

RossFabricant
<stolenQuote>Just wanted to pipe in that python's random.shuffle() does use the Fisher-Yates algorithm.</stolenQuote>
Hamish Grubijan
Fisher-Yates is proven to be unbiased and is about as simple and efficient as it gets. There are very few reasons to use anything else.
dsimcha
+1  A: 

Analysing algorithms to check if they are truly random is exceedingly hard.
Apart from people with college level maths (or as Americans put it, a major in Math), this is way beyond most people's skill to even validate.

Thus you should try and use algorithms that are already built.
Have you looked at std::random_shuffle()?

 void bag( int random_sequence[NBR] )
 {
     for(int i=0; i<NBR; ++i) 
     {    random_sequence[i] = i+1;
     }
     std::random_shuffle(random_sequence,random_sequence + NBR);
 }

Quote from std::random_shuffle() page:

This algorithm is described in section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981). Knuth credits Moses and Oakford (1963) and Durstenfeld (1964). Note that there are N! ways of arranging a sequence of N elements. Random_shuffle yields uniformly distributed results; that is, the probability of any particular ordering is 1/N!. The reason this comment is important is that there are a number of algorithms that seem at first sight to implement random shuffling of a sequence, but that do not in fact produce a uniform distribution over the N! possible orderings. That is, it's easy to get random shuffle wrong.

Martin York
+1, but three little things: 1) College (no d). 2) I think this is a learning exercise rather than production code. 3) I can tall the OP's algorithm isn't truly random because `rand() % i` will favor lower results if `(RAND_MAX + 1) % i != 0` (which is probably true for most values of `i`).
Chris Lutz
@Chris: could you elaborate on why rand() % i should favor lower results even when RAND_MAX is large enough beside i ? I believe many generators use LCG and such bias won't be detectable if you do not go for the full sequence.
kriss
@Kriss: Yes the difference is small. But the point is that it is measurable and thus basis is introduced. That is why good random text books explain in detail why you should use " floor(rand()/(RAND_MAX + 1,o) * RANGE) " (Though still not perfect is better than using modulas). But getting better than this requires maths skills why bond my ability. Thus I prefer to use established algorithms that have been written by people with the appropriate knowledge and education.
Martin York
@Kriss: I think a simpler answer is the following: if RAND_MAX were 4 and i were 3, then 0 and 1 would be output 40% of the time, while 2 would be output 20% of the time. As the numbers get larger, the gap narrows but never gets to zero.
Mark Ruzon
A: 

As others have already commented, use a proven shuffling algorithm.

It is worth noting that your C/C++ library is only providing a pseudo-random number.

Systems that require high reliability of the randomization algorithm use dedicated hardware to generate the random number. High-end poker sites are a good example. See for example the Pokerstars writeup on their random number generation technique.

Early versions of Netscape encryption were broken because the hackers were able to predict the "random" number used because the pseudo-random number generator was seeded with the current time. See this writeup on Wikipedia.

Eric J.
+1  A: 

Just a couple stylistic points:

  1. Your signature of taking an array with a given length might give the illusion that the parameter is guaranteed by the compiler to contain at least IDX elements. It's not.
  2. I would probably give the loop index in the second for loop a more desciptive name like marblesRemaining, so it's clearer what it is and does not need to be explained by comments. It would also separate it from the entirely different use it has in the first loop.
JohnMcG
1.Yes I realize that, I added the given length only at the end as a kind of executable comment. 2. Good point. I avoid and detest loop couters with long descriptive pointless names, but you're right, this is not just a loop counter and so a descriptive name would be helpful. So +1
Bill Forster
+7  A: 
int idx = x%i;  // eg i=90 idx is random in range 0..89

It's in that range, but it's not evenly distributed, unless 90 (or NBR) divides max(rand()). If you're on a 2-bit computer, that's probably not true. Here, idx is slightly more likely to be 0 than to be 89, for example.

Ken
Good point! It's only a very slight bias if n will be small, but it will be noticable if you run enough tests.
Mark Byers
+1 Bingo​​​​​​​​​​​​​​​​​​​​​​​​​.
Chris Lutz
@Ken: want to bet rand(n) has the same kind of bias on many implementations ?
kriss
kriss: It's easy to get rid of that bias. See `java.util.Random.nextInt(int)` for example. I doubt implementations will yield a biased number in that context.
Joey
+2  A: 

An alternative to rand() % i which will have a better near-uniform distribution (at the expense of performance) is (int) ((rand() / (double) (RAND_MAX+1)) * i).

Or, use a pseudo-random number generation algorithm known to work well such as the Mersenne twister.

Carlo
No need to cast to double. By adding 1.0 rather than 1 you get a double.
Martin York
Although that just moves the semantics of the cast to the compiler. At the assembly level, a cast still happens whether you explicitly code it or implicitly require it by adding a double.
Eric J.
+1  A: 

Quibbles over random number generation aside, your shuffle algorithm looks correct.

You can improve it, though: with a little thought, you can see that you can shuffle the numbers in place. So, instead of allocating a temporary array, you can just use the output buffer.

comingstorm
Indeed +1. I would end up with precisely the Fisher Yates algorithm that others have supplied
Bill Forster