views:

1589

answers:

11

it seems that this simple shuffle algorithm will produce biased results:

# suppose $arr is filled with 1 to 52

for ($i < 0; $i < 52; $i++) { 
  $j = rand(0, 51);

  # swap the items

  $tmp = $arr[j];
  $arr[j] = $arr[i];
  $arr[i] = $tmp;
}

you can try it... instead of using 52, use 3 (suppose only 3 cards are used), and run it 10,000 times and tally up the results, you will see that the results are skewed towards certain patterns...

the question is... what is a simple explanation that it will happen?

the correct solution is to use something like

for ($i < 0; $i < 51; $i++) {  # last card need not swap 
  $j = rand($i, 51);        # don't touch the cards that already "settled"

  # swap the items

  $tmp = $arr[j];
  $arr[j] = $arr[i];
  $arr[i] = $tmp;
}

but the question is... why the first method, seemingly also totally random, will make the results biased?

Update 1: thanks for folks here pointing out that it needs to be rand($i, 51) for it to shuffle correctly.

+25  A: 

See this:
The Danger of Naïveté (Coding Horror)

Let's look at your three card deck as an example. Using a 3 card deck, there are only 6 possible orders for the deck after a shuffle: 123, 132, 213, 231, 312, 321.

With your 1st algorithm there are 27 possible paths (outcomes) for the code, depending on the results of the rand() function at different points. Each of these outcomes are equally likely (unbiased). Each of these outcomes will map to the same single result from the list of 6 possible "real" shuffle results above. We now have 27 items and 6 buckets to put them in. Since 27 is not evenly divisible by 6, some of those 6 combinations must be over-represented.

With the 2nd algorithm there are 6 possible outcomes that map exactly to the 6 possible "real" shuffle results, and they should all be represented equally over time.

This is important because the buckets that are over-represented in the first algorithm are not random. The buckets selected for the bias are repeatable and predictable. So if you're building an online poker game and use the 1st algorithm a hacker could figure out you used the naive sort and from that work out that certain deck arrangements are much more likely to occur than others. Then they can place bets accordingly. They'll lose some, but they'll win much more than they lose and quickly put you out of business.

Joel Coehoorn
Link doesn't work for me, changing to .html from .htm worked.
Davy8
Must've broke it when I edited to show the article title rather than the url in the text. Fixed now :(
Joel Coehoorn
while i have tremendous respect for math, i think the explanation of "since it is not divisible" is a little bit "after the fact explanation". What if it happens to be divisible for some number n, does that mean it won't be biased? Is there an explanation otherwise -- such as for the 3 card case, why a certain card end up at a particular location more often.
動靜能量
each of the 27 outcomes occur without bias. each of those outcomes also maps to exactly one of the 6 'real' outcomes. since 6 won't go evenly into 27, some of the real outcomes _must_ be biased to occur more than the others.
Joel Coehoorn
To answer Jian's comment, n^n will never be exactly divisible by n! for n>2. Proof: any prime factor of n-1 can not divide n, for then it would have to also divide the difference of n and n-1, or 1. For n>2, there must be such a prime factor, which divides n! but does not divide n^n, since it does not divide n. Hence n! can not divide n^n, which means that the algorithm can never be unbiased for any n>2.
dewtell
how about if we look at a simple case: if we have 27000002 drops of water, and distribute them among 5 buckets. so we put the first drop to the first bucket, second drop to the second bucket, ... and repeat it, and at the end, we can also "use math" to say, they are not divisible and therefore, they are not evenly distributed. Well, the thing is that they are not evenly distributed, but they are very close. So for the math explanation such as the one used for the shuffle algorithm, how come the results cannot be "close enough"?
動靜能量
Because it's predictable which buckets will fill up first. An attacker can use that information to break something.
Joel Coehoorn
hm... assuming the buckets don't get filled up, and let's say we generate a random number from 1 to 5, whatever that number is, we put the next drop of water to that bucket. now, even the number is not divisible by 5, it will be pretty close to even distribution. so how do we know that the "non divisible distribution" for the shuffle problem won't be a pretty close to even distribution?
動靜能量
Your premise is flawed. If you generate a truly random number from 1 to 5, the drops will be evenly distributed among your five buckets. This is more like generating a random number from 1 to 6, and for 5 buckets always putting the '6' in bucket 1 instead. Over time, bucket 1 _will_ get a lot more attention, and crackers do know how to take advantage of that.
Joel Coehoorn
@Joel +1 for the 6 numbers 5 buckets analogy
Davy8
This answer is right, and explains why you cannot get *the* uniform distribution, but it's not the full story: the bad algorithm is not just "not uniform", it's actually *far* from uniform. E.g. with n=4, 4^4=256 possibilities _could_ map into the 4!=24 permutations each 10 or 11 times and be somewhat close to uniform, but in fact the counts of the permutations go all the way from 8 to 15. For n=6, you have all the way from 32 to 159 — some permutations are almost FIVE times as likely as the others, which is more variation than is implied by the divisibility argument alone.
ShreevatsaR
Jian - You can get "pretty close" to the right answer if you repeat the swap move enough times. In my answer below, I showed that with three cards, the difference between the highest and lowest probabilities is reduced by at least 2/3 with every swap move (can be more, depending on which position you pick to swap). So after 34 swaps, you would get within 1e-6 of the same probabilities for each position for card 2, even if you were swapping the card in position A every time. That's a lot more than the 2 swaps required by Fisher-Yates to give you the exact answer, so it's not very practical.
dewtell
To illustrate my previous comment: After 0 swaps, the probabilities for card 2 are [0 1 0](max difference=1). After swapping position A, they are [1/3 2/3 0] (diff=2/3). After swapping position B, we get [4/9 1/3 2/9] (diff=2/9). After swapping position C, we get [10/27 8/27 1/3] (diff=2/27). So we are converging to the "right" answer, it's just that three swaps aren't enough to get very close.
dewtell
+14  A: 

The best explanation I've seen for this effect was from Jeff Atwood on his CodingHorror blog (The Danger of Naïveté).

Using this code to simulate a 3-card random shuffle...

for (int i = 0; i < cards.Length; i++)
{
    int n = rand.Next(cards.Length);
    Swap(ref cards[i], ref cards[n]);
}

...you get this distribution.

Distribution of 3-card shuffle

The shuffle code (above) results in 3^3 (27) possible deck combinations. But the mathematics tell us that there are really only 3! or 6 possible combinations of a 3 card deck. So some of the combinations are over-represented.

You would need to use a Fisher-Yates shuffle to properly (randomly) shuffle a deck of cards.

Enjoy,

Robert C. Cartaino

Robert Cartaino
Are you sure that's not "Cardano" ;)
erickson
is there a non-math answer? please see the comment under Joel Coehoorn's answer.
動靜能量
+1  A: 

See the Coding Horror post The Danger of Naïveté.

Basically (suposing 3 cards):

The naive shuffle results in 33 (27) possible deck combinations. That's odd, because the mathematics tell us that there are really only 3! or 6 possible combinations of a 3 card deck. In the KFY shuffle, we start with an initial order, swap from the third position with any of the three cards, then swap again from the second position with the remaining two cards.

eKek0
+13  A: 

Here's the complete probability tree for these replacements.

Let's assume that you start with the sequence 123, and then we'll enumerate all the various ways to produce random results with the code in question.

123
 +- 123          - swap 1 and 1 (these are positions,
 |   +- 213      - swap 2 and 1  not numbers)
 |   |   +- 312  - swap 3 and 1
 |   |   +- 231  - swap 3 and 2
 |   |   +- 213  - swap 3 and 3
 |   +- 123      - swap 2 and 2
 |   |   +- 321  - swap 3 and 1
 |   |   +- 132  - swap 3 and 2
 |   |   +- 123  - swap 3 and 3
 |   +- 132      - swap 2 and 3
 |       +- 231  - swap 3 and 1
 |       +- 123  - swap 3 and 2
 |       +- 132  - swap 3 and 3
 +- 213          - swap 1 and 2
 |   +- 123      - swap 2 and 1
 |   |   +- 321  - swap 3 and 1
 |   |   +- 132  - swap 3 and 2
 |   |   +- 123  - swap 3 and 3
 |   +- 213      - swap 2 and 2
 |   |   +- 312  - swap 3 and 1
 |   |   +- 231  - swap 3 and 2
 |   |   +- 213  - swap 3 and 3
 |   +- 231      - swap 2 and 3
 |       +- 132  - swap 3 and 1
 |       +- 213  - swap 3 and 2
 |       +- 231  - swap 3 and 3
 +- 321          - swap 1 and 3
     +- 231      - swap 2 and 1
     |   +- 132  - swap 3 and 1
     |   +- 213  - swap 3 and 2
     |   +- 231  - swap 3 and 3
     +- 321      - swap 2 and 2
     |   +- 123  - swap 3 and 1
     |   +- 312  - swap 3 and 2
     |   +- 321  - swap 3 and 3
     +- 312      - swap 2 and 3
         +- 213  - swap 3 and 1
         +- 321  - swap 3 and 2
         +- 312  - swap 3 and 3

Now, the fourth column of numbers, the one before the swap information, contains the final outcome, with 27 possible outcomes.

Let's count how many times each pattern occurs:

123 - 4 times
132 - 5 times
213 - 5 times
231 - 5 times
312 - 4 times
321 - 4 times
=============
     27 times total

If you run the code that swaps at random for an infinite number of times, the patterns 132, 213 and 231 will occur more often than the patterns 123, 312, and 321, simply because the way the code swaps makes that more likely to occur.

Now, of course, you can say that if you run the code 30 times (27 + 3), you could end up with all the patterns occuring 5 times, but when dealing with statistics you have to look at the long term trend.

Here's C# code that explores the randomness for one of each possible pattern:

class Program
{
    static void Main(string[] args)
    {
        Dictionary<String, Int32> occurances = new Dictionary<String, Int32>
        {
            { "123", 0 },
            { "132", 0 },
            { "213", 0 },
            { "231", 0 },
            { "312", 0 },
            { "321", 0 }
        };

        Char[] digits = new[] { '1', '2', '3' };
        Func<Char[], Int32, Int32, Char[]> swap = delegate(Char[] input, Int32 pos1, Int32 pos2)
        {
            Char[] result = new Char[] { input[0], input[1], input[2] };
            Char temp = result[pos1];
            result[pos1] = result[pos2];
            result[pos2] = temp;
            return result;
        };

        for (Int32 index1 = 0; index1 < 3; index1++)
        {
            Char[] level1 = swap(digits, 0, index1);
            for (Int32 index2 = 0; index2 < 3; index2++)
            {
                Char[] level2 = swap(level1, 1, index2);
                for (Int32 index3 = 0; index3 < 3; index3++)
                {
                    Char[] level3 = swap(level2, 2, index3);
                    String output = new String(level3);
                    occurances[output]++;
                }
            }
        }

        foreach (var kvp in occurances)
        {
            Console.Out.WriteLine(kvp.Key + ": " + kvp.Value);
        }
    }
}

This outputs:

123: 4
132: 5
213: 5
231: 5
312: 4
321: 4

So while this answer does in fact count, it's not a purely mathematical answer, you just have to evaluate all possible ways the random function can go, and look at the final outputs.

Lasse V. Karlsen
+2  A: 

Here's another intuition: the single shuffle swap can't create symmetry in the probability of occupying a position unless at least 2-way symmetry already exists. Call the three positions A, B, and C. Now let a be the probability of card 2 being in position A, b be the probability of card 2 being in position B, and c be the probability of it being in position C, prior to a swap move. Assume that no two probabilities are the same: a!=b, b!=c, c!=a. Now compute the probabilities a', b', and c' of the card being in these three positions following a swap. Let's say that this swap move consists of position C being swapped with one of the three positions at random. Then:

a' = a*2/3 + c*1/3
b' = b*2/3 + c*1/3
c' = 1/3.

That is, the probability that the card winds up in position A is the probability it was already there times the 2/3 of the time position A isn't involved in the swap, plus the probability that it was in position C times the 1/3 probability that C swapped with A, etc. Now subtracting the first two equations, we get:

a' - b' = (a - b)*2/3

which means that because we assumed a!=b, then a'!=b' (though the difference will approach 0 over time, given enough swaps). But since a'+b'+c'=1, if a'!=b', then neither can be equal to c' either, which is 1/3. So if the three probabilities start off all different before a swap, they will also all be different after a swap. And this would hold no matter which position was swapped - we just interchange the roles of the variables in the above.

Now the very first swap started by swapping card 1 in position A with one of the others. In this case, there was two way symmetry before the swap, because the probability of card 1 in position B = probability of card 1 in position C = 0. So in fact, card 1 can wind up with symmetric probabilities and it does end up in each of the three positions with equal probability. This remains true for all subsequent swaps. But card 2 winds up in the three positions after the first swap with probability (1/3, 2/3, 0), and likewise card 3 winds up in the three positions with probability (1/3, 0, 2/3). So no matter how many subsequent swaps we do, we will never wind up with card 2 or 3 having exactly the same probability of occupying all three positions.

dewtell
+6  A: 

From your comments on the other answers, it seems that you are looking not just for an explanation of why the distribution is not the uniform distribution (for which the divisibility answer is a simple one) but also an "intuitive" explanation of why it is actually far from uniform.

Here's one way of looking at it. Suppose you start with the initial array [1, 2, ..., n] (where n might be 3, or 52, or whatever) and apply one of the two algorithms. If all permutations are uniformly likely, then the probability that 1 remains in the first position should be 1/n. And indeed, in the second (correct) algorithm, it is 1/n, as 1 stays in its place if and only if it is not swapped the first time, i.e. iff the initial call to rand(0,n-1) returns 0.
However, in the first (wrong) algorithm, 1 remains untouched only if it is neither swapped the first time nor any other time — i.e., only if the first rand returns 0 and none of the other rands returns 0, the probability of which is (1/n) * (1-1/n)^(n-1) ≈ 1/(ne) ≈ 0.37/n, not 1/n.

And that's the "intuitive" explanation: in your first algorithm, earlier items are much more likely to be swapped out of place than later items, so the permutations you get are skewed towards patterns in which the early items are not in their original places.

(It's a bit more subtle than that, e.g. 1 can get swapped into a later position and still end up getting swapped back into place through a complicated series of swaps, but those probabilities are relatively less significant.)

ShreevatsaR
A: 

Here's a great analysis of a card shuffling Markov chains. Oh wait, that's all math. Sorry. :)

JP Alioto
+1  A: 

The simple answer is that there are 52^52 possible ways for this algorithm to run, but there are only 52! possible arrangements of 52 cards. For the algorithm to be fair, it needs to produce each of these arrangements equally likely. 52^52 is not an integer multiple of 52!. Therefore, some arrangements must be more likely than others.

A: 

random number generation algorithms are not really random. They're called 'pseudo random' for that reason. If you really need random then check out hardware methods of getting a random number.

Jay
A: 

an illustrative approach might be this:

1) consider only 3 cards.

2) for the algorithm to give evenly distributed results, the chance of "1" ending up as a[0] must be 1/3, and the chance of "2" ending up in a[1] must be 1/3 too, and so forth.

3) so if we look at the second algorithm:

probability that "1" ends up at a[0]: when 0 is the random number generated, so 1 case out of (0,1,2), therefore, is 1 out of 3 = 1/3

probability that "2" ends up at a[1]: when it didn't get swapped to a[0] the first time, and it didn't get swapped to a[2] the second time: 2/3 * 1/2 = 1/3

probability that "3" ends up at a[2]: when it didn't get swapped to a[0] the first time, and it didn't get swapped to a[1] the second time: 2/3 * 1/2 = 1/3

they are all perfectly 1/3, and we don't see any error here.

4) if we try to calculate the probability of of "1" ending up as a[0] in the first algorithm, the calculation will be a bit long, but as the illustration in lassevk's answer shows, it is 9/27 = 1/3, but "2" ending up as a[1] has a chance of 8/27, and "3" ending up as a[2] has a chance of 9/27 = 1/3.

as a result, "2" ending up as a[1] is not 1/3 and therefore the algorithm will produce pretty skewed result (about 3.7% error, as opposed to any negligible case such as 3/10000000000000 = 0.00000000003%)

5) the proof that Joel Coehoorn has, actually can prove that some cases will be over-represented. I think the explanation that why it is n^n is this: at each iteration, there are n possibility that the random number can be, so after n iterations, there can be n^n cases = 27. This number doesn't divid the number of permuations (n! = 3! = 6) evenly in the case of n = 3, so some results are over-represented. they are over-represented in a way that instead of showing up 4 times, it shows up 5 times, so if you shuffle the cards millions of times from the initial order of 1 to 52, the over-represented case will show up 5 million times as opposed to 4 million times, which is quite big a difference.

6) i think the over-representation is shown, but "why" will the over-representation happen?

7) an ultimate test for the algorithm to be correct is that any number has a 1/n probability to end up at any slot.

動靜能量
A: 

The Naive algorithm picks the values of n like so:

n = rand(3)

n = rand(3)

n = rand(3)

3^3 possible combinations of n

1,1,1, 1,1,2....3,3,2 3,3,3 (27 combinations) lassevk's answer shows the distribution among the cards of these combinations.

the better algorithm does:

n = rand(3)

n = rand(2)

n! possible combinations of n

1,1, 1,2, 2,1 2,2 3,1 3,2 (6 combinations, all of them giving a different result)

As mentioned in the other answers, if you take 27 attempts to get 6 results, you cannot possibly attain the 6 results with even distribution, since 27 is not divisible by 6. Put 27 marbles into 6 buckets and no matter what you do, some buckets will have more marbles than others, the best you can do is 4,4,4,5,5,5 marbles for buckets 1 through 6.

the fundamental problem with the naive shuffle is that swaps too many times, to shuffle 3 cards completely, you need only do 2 swaps, and the second swap need only be among the first two cards, since the 3rd card already had a 1/3 chance of being swapped. to continue to swap cards will impart more chances that a given card will be swapped, and these chances will only even out to 1/3, 1/3, 1/3 if your total swap combinations is divisible by 6.

Atilio Jobson