views:

213

answers:

3

Just as background, I'm aware of the Fisher-Yates perfect shuffle. It is a great shuffle with its O(n) complexity and its guaranteed uniformity and I'd be a fool not to use it ... in an environment that permits in-place updates of arrays (so in most, if not all, imperative programming environments).

Sadly the functional programming world doesn't give you access to mutable state.

Because of Fisher-Yates, however, there's not a lot of literature I can find on how to design a shuffling algorithm. The few places that address it at all do so briefly before saying, in effect, "so here's Fisher-Yates which is all the shuffling you need to know". I had to, in the end, come up with my own solution.

The solution I came up with works like this to shuffle any list of data:

  • If the list is empty, return the empty set.
  • If the list has a single item, return that single item.
  • If the list is non-empty, partition the list with a random number generator and apply the algorithm recursively to each partition, assembling the results.

In Erlang code it looks something like this:

shuffle([])  -> [];
shuffle([L]) -> [L];
shuffle(L)   ->
  {Left, Right} = lists:partition(fun(_) -> 
                                    random:uniform() < 0.5 
                                  end, L),
  shuffle(Left) ++ shuffle(Right).

(If this looks like a deranged quick sort to you, well, that's what it is, basically.)

So here's my problem: the same situation that makes finding shuffling algorithms that aren't Fisher-Yates difficult makes finding tools to analyse a shuffling algorithm equally difficult. There's lots of literature I can find on analysing PRNGs for uniformity, periodicity, etc. but not a lot of information out there on how to analyse a shuffle. (Indeed some of the information I found on analysing shuffles was just plain wrong -- easily deceived through simple techniques.)

So my question is this: how do I analyse my shuffling algorithm (assuming that the random:uniform() call up there is up to the task of generating apropriate random numbers with good characteristics)? What mathematical tools are there at my disposal to judge whether or not, say, 100,000 runs of the shuffler over a list of integers ranging 1..100 has given me plausibly good shuffling results? I've done a few tests of my own (comparing increments to decrements in the shuffles, for example), but I'd like to know a few more.

And if there's any insight into that shuffle algorithm itself that would be appreciated too.

A: 

The answers to this question might help:

http://stackoverflow.com/questions/1685339/verify-knuth-shuffle-algorithm-is-as-unbiased-as-possible

It'd also be worth taking a look at Knuth's analysis of Fisher-Yates (see the wikipedia article you linked for a citation).

Alex Mendes da Costa
+5  A: 

Your algorithm is a sort-based shuffle, as discussed in the Wikipedia article.

Generally speaking, the computational complexity of sort-based shuffles is the same as the underlying sort algorithm (e.g. O(n log n) average, O(n²) worst case for a quicksort-based shuffle), and while the distribution is not perfectly uniform, it should approach uniform close enough for most practical purposes.

Oleg Kiselyov provides the following article / discussion:

which covers the limitations of sort-based shuffles in more detail, and also offers two adaptations of the Fischer–Yates strategy: a naive O(n²) one, and a binary-tree-based O(n log n) one.

Sadly the functional programming world doesn't give you access to mutable state.

This is a complete myth: functional programming avoids side effects, but supports mutable state perfectly well, without invoking side effects.

In this particular case, you can use Haskell's mutable arrays to implement the Fischer–Yates algorithm as described in this tutorial:

Addendum

The specific foundation of your shuffle sort is actually an infinite-key radix sort: as gasche points out, each partition corresponds to a digit grouping.

The main disadvantage of this is the same as any other infinite-key sorting shuffle: there is no termination guarantee. Although the likelihood of termination increases as the comparison proceeds, there is never an upper bound: the worst-case complexity is O(∞).

Piet Delport
Sorry, I was less than precise. *Efficient* access to mutable state. ;)
JUST MY correct OPINION
What makes you think it's not efficient?
Piet Delport
It is possible and easy to fix a simple sort-based shuffled to be *perfectly* uniform (provided the underlying random number generator is also perfectly uniform), without going through the additional complexity of Oleg's pure solution.You lose uniformity when two elements are compared equal during the sorting : an arbitrary choice has to be made to order them.You can pick weights that are guaranteed never to be equal, such as randomly-picked real numbers (floats, or even better lazy boolean streams). Cf [haskell-beginners](http://www.haskell.org/pipermail/beginners/2010-August/005165.html)
gasche
gasche: That is still *close* to uniform, not perfectly uniform. There are four problems to overcome: (1) With any selection from a finite key space, duplicates are unavoidable, by definition. (2) If you make the key space infinite, as with lazy boolean streams, your algorithm is no longer guaranteed to terminate. (3) If you discard and re-pick duplicates, you introduce bias, and your keys are no longer uniform. (4) As Oleg points out, even if you could solve the previous three problems, you would still have to prove that the configurational space of key assignments is exactly divisible by N!.
Piet Delport
The algorithm with lazy boolean streams terminate with probability 1. This is not the same things as "always terminate", in particular you're vulnerable to an evil random number generator (that would always output "1" for example), but it's however a quite strong guarantee.Re. N! : of course if you uniformly pick N *distinct* weights, the configurational space of their orderings is of size N!.
gasche
gasche: "approaches 1" is not the same as "equals 1". If the random numbers are unbiased, there will always be an execution path that never terminates, by definition. You can calculate the *likelihood* that the comparison terminates after N steps, but you cannot guarantee anything: there will always be a non-zero chance of non-termination.
Piet Delport
Also, no, the configurational space is *not* necessarily N!. The number of final *orderings* might be N!, but each final ordering corresponds to multiple possible input configurations, and unless *those* are exactly balanced, some final orderings will be slightly more likely than others. This is what Oleg was getting at.
Piet Delport
What about doing steps 1-3 without worrying about distinct sort keys, but then searching for any runs of consecutive sort keys and recursively shuffling those?
supercat
supercat: Same problem: there is no guarantee that the recursion will ever terminate. You can only calculate the likelihood that it terminates after N steps, which approaches (but never reaches) 1. In addition, the resulting shuffle still won't be uniform unless you solve the other problems.
Piet Delport
"You can only calculate the likelihood that it terminatesafter N steps, which approaches (but never reaches) 1"My claim is that the algorithm *terminates* with probability 1 : with probability 1, there exist a N such that the algorithm returns its result after N step. Why ? Because the other possibility is that *for all N*, the algorithm isn't finished after N steps. But that has probability 0.
gasche
gasche: That is incorrect. It is *always* possible for two key bit choices to both be 1 or both be 0, requiring another iteration: any attempt to change this must introduce bias. If you make a guarantee that the bits will differ after N steps, you can guarantee termination, but only at the cost of no longer guaranteeing uniformity. In general, it's impossible for any algorithm based on this approach to provide both uniformity and termination at the same time: each one mutually excludes the other, so a choice must be made.
Piet Delport
I'm talking about *probabilistic termination*. Saying that an event has probability 0 is *not* the same thing as saying that it does not exist. You're right that, for some sequences of bit choices, the algorithm does not terminate. But the set of such sequences, in the set of all sequences, has [measure](http://en.wikipedia.org/wiki/Lebesgue_measure) 0.In another context : if you uniformly pick a random real number (as a stream of booleans), the probability that this number is rational is 0. Similarly, if you pick two real numbers, the probability that they're equal *is* 0.
gasche
I added a section about the probabilistic termination condition in my answer.
gasche
gasche: Be careful about confusing measures and probabilities: they are not the same. The measure is 0 because its definition is rooted in infinite sets. Termination, on the other hand, is *finite*, by definition: the probability of not terminating after N steps is *always* non-zero for any N, and this has nothing to do with the measure.
Piet Delport
I have the impression we're talking past each other. I claim that, for all array input T the event (there exists N such that the algorithm has terminated after N steps on input T) has probability exactly 1. I don't exactly understand what you mean and where our disagreement lies, but I understand your claim that, for any input T (there do *not* exist a N such that, with probability 1, the algorithm as terminated after N steps on input T). Both claims are true, but I think only the first one means that "(the algorithm terminates) with probability 1".
gasche
gasche: *"there exists N such that the algorithm has terminated after N steps on input T"* is obviously false, with the counterexample for any N being the sequences of length N+1 differing only in the last digit. This is a very simple inductive proof: for all N, you can provide a non-terminating input sequence.
Piet Delport
More broadly, the problem seems to be that you are using the word "probability" informally: you are talking about claims of *existence*, which (regardless of their truth) are orthogonal to mathematical *probability*, which is defined of in terms of *events*. Termination *is* an event, and it is easy to prove that for all N, any bitwise comparison of length N yields (4^N - 2^N) possibilities that *do* terminate, and a remaining (2^N) possibilities that *don't* terminate: the probability of non-termination, in other words, is 2^-N, which is always greater than zero.
Piet Delport
gasche: A different and more general formal proof of non-termination in termination analysis is by observing that the loop does not have a [variant](http://en.wikipedia.org/wiki/Loop_variant) function. The loop state (i.e., the keys being compared) forms an [infinite descending chain](http://en.wikipedia.org/wiki/Infinite_descending_chain): this means that it fails the descending chain condition of terminating loops.
Piet Delport
Ok. There is obviously something you don't understand about what the probability of a "there exists N ..."-style formula is. As my repeated tentatives of explanations didn't work, I'm left with the argument from authority. I tried to find relevant work in the scientific litterature, but have found no clear and simple paper explaining exactly this (probably too simple). Please however read the *Example 1* (pages 1 and 2) of the following paper : [A formal approach to probabilistic termination](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.6208).You may only edit a comment every 5
gasche
I quote here the relevant paragraph: "[...] will return the natural number n with probability 2^{-(N+1)}. A simple way to implement this is to return the index of the first coin-flip in the sequence that is ‘heads’. However, this is not guaranteed to terminate on every possible input sequences of coin-flips, the counter-examplebeing the ‘all-tails’ sequence. However, the algorithm does satisfies probabilistic termination, meaning that the probability that it terminates is 1 (“in all practical situations it must terminate”)."
gasche
gasche: Did you actually read that paper? It is correct, and your quote repeats my explanation that this class of algorithm is **not** guaranteed to terminate. The paper itself introduces the new, analogous notion of *probabilistic* termination, which is formally distinct from and strictly weaker than normal termination: an algorithm that only satisfies *probabilistic* termination is proven to become more likely to terminate with time, but specifically does **not** provide guaranteed termination.
Piet Delport
If you're only skimming, refer to the first half of the paper's conclusion, where the author discusses the logical distinction between probabilistic termination and guaranteed termination, and how programs with probabilistic termination are still unbounded. (In other words: even though they are in a formal sense "less unbounded" than diverging loops, they are still **not** bounded or fully terminating.)
Piet Delport
I completely agree that probabilistic termination is not the same thing as non-probabilistic (which you call "guaranteed") termination. I've actually said that numerous times during our discussion, in my second comment to this thread ("The algorithm with lazy boolean streams terminate with probability 1. This is not the same things as "always terminate", [..]"), or the "Probabilistic termination" section I added to my own answer later. I've always been talking about probabilistic termination, and claimed that the probability of termination is *exactly* 1 (which you seemed to disagree with).
gasche
I'm sorry if I sound picky or aggressive in my replies. I know you're a reasonable person, and I'm simply frustrated that you don't seem to understand what I say, despite my successive attempts to achieve formal precision on the points that seemed controversial.
gasche
gasche: Thank you for your courtesy—i do appreciate it! In reply: it's not so much me but the author of the paper that calls it "guaranteed" termination, though, and then only in order to distinguish it from "probabilistic termination". Everybody else simply calls the former "termination", to the extent that that's the name of the field. The confusion seems to have come from the author saying "probability that it terminates is 1" when they actually meant "probability that it *probabilistically*-terminates is 1", using their modified, non-finite (and nonstandard) definition of "terminate".
Piet Delport
I think that when you say "terminates with probability 1", it should be clear that you're talking about probabilistic termination. That's what I've been doing all along. And I don't think this definition is really non-standard : it's certainly less often used than the usual termination, because few algorithms have random variables in their running times, but as soon as you encounter one such algorithm the definition is pretty natural : what is the measure of terminating runs in the space of all possible runs ? My own intuition coincided perfectly with the article definition.
gasche
+3  A: 

General remark

My personal approach about correctness of probability-using algorithms : if you know how to prove it's correct, then it's probably correct; if you don't, it's certainly wrong.

Said differently, it's generally hopeless to try to analyse every algorithm you could come up with : you have to keep looking for an algorithm until you find one that you can prove correct.

Analysing a random algorithm by computing the distribution

I know of one way to "automatically" analyse a shuffle (or more generally a random-using algorithm) that is stronger than the simple "throw lots of tests and check for uniformity". You can mechanically compute the distribution associated to each input of your algorithm.

The general idea is that a random-using algorithm explores a part of a world of possibilities. Each time your algorithm asks for an element randomly choosen in a set ({true, false} when flipping a coin), there are two possible outcomes for your algorithm, and one of them is chosen. You can change your algorithm so that, instead of returning one of the possible outcomes, it explores all solutions in parallel and return all possible outcomes with the associated distributions.

In general, that would require to rewrite your algorithm in depth. If your language supports delimited continuations, you don't have to, you can implementation that "exploration of all possible outcomes" inside the function asking for a random element (the idea is that the random generator, instead of returning a result, capture the continuation associated to your program and run it with all different results). For an example of this approach, see oleg's HANSEI

An intermediary, and probably less arcane, solution is to represent this "world of possible outcomes" as a monad, and use a language such as Haskell with facilities for monadic programming.
Here is an example implementation of a variant¹ of your algorithm, in Haskell, using the probability monad of the probability package :

import Numeric.Probability.Distribution

shuffleM :: (Num prob, Fractional prob) => [a] -> T prob [a]
shuffleM [] = return []
shuffleM [x] = return [x]
shuffleM (pivot:li) = do
        (left, right) <- partition li
        sleft <- shuffleM left
        sright <- shuffleM right
        return (sleft ++ [pivot] ++ sright)
  where partition [] = return ([], [])
        partition (x:xs) = do
                  (left, right) <- partition xs
                  uniform [(x:left, right), (left, x:right)]

You can run it for a given input, and get the output distribution :

*Main> shuffleM [1,2]
fromFreqs [([1,2],0.5),([2,1],0.5)]
*Main> shuffleM [1,2,3]
fromFreqs
  [([2,1,3],0.25),([3,1,2],0.25),([1,2,3],0.125),
   ([1,3,2],0.125),([2,3,1],0.125),([3,2,1],0.125)]

You can see that this algorithm is uniform with inputs of size 2, but non-uniform on inputs of size 3.
The difference with the test-based approach is that we can gain absolute certainty in a finite number of steps : it can be quite big, as it amounts to an exhaustive exploration of the world of possibles (but generally smaller than 2^N, as their are factorisations of similar outcomes), but if it returns a non-uniform distribution we know for sure that the algorithm is wrong. Of course, if it returns an uniform distribution for [1..N] and 1 <= N <= 100, you only know that your algorithm is uniform upto lists of size 100, it may still be wrong.

¹: this algorithm is a variant of your Erlang's implementation, because of the specific pivot handling. If I use no pivot, like in your case, the input size doesn't decrease at each step anymore : the algorithm also consider the case were all inputs are in the left list (or right list), and get lost in an infinite loop. This is a weakness of the probability monad implementation (if an algorithm has a probability 0 of non-termination, the distribution computation may still diverge), that I don't know yet how to fix.

Sort-based shuffles

Here is a simple algorithm that I feel confident I could prove correct :

  1. pick a random number (a weight) for each element in your collection
  2. if the random weights are not all distinct, restart step 1.
  3. "sort" your collection, using the comparison on the random weights associated to each element

You can forget step 2. if you know the probability of "conflict" (two random numbers picked are equal) is low enough, but unless it's exactly 0 your shuffle isn't uniform anymore.

If you pick your weights in [1..N] where N is the length of your collection, you'll have lots of conflicts (Birthday problem). If you pick your weight in [min_int..max_int] or as a random float, the probability of conflict is low in practice (but still subject to the birthday problem, see Piet Delport's comment).

If you pick real numbers (as random infinite lazy lists of booleans) as weights, it is 0 in theory, and you don't need to check for distinctness.

Here is a shuffle implementation in OCaml, using lazy real numbers :

type 'a stream = Cons of 'a * 'a stream lazy_t

let rec real_number () =
  Cons (Random.bool (), lazy (real_number ()))

let rec compare_real a b = match a, b with
| Cons (true, _), Cons (false, _) -> 1
| Cons (false, _), Cons (true, _) -> -1
| Cons (_, lazy a'), Cons (_, lazy b') ->
    compare_real a' b'

let shuffle list =
  List.map snd
    (List.sort (fun (ra, _) (rb, _) -> compare_real ra rb)
       (List.map (fun x -> real_number (), x) list))

There are other approaches to "pure shuffling". A nice one is apfelmus's mergesort-based solution.

Algorithmic considerations : the complexity of the previous algorithm depends on the probability that all picked weights are distinct. If you pick them in [min_int..max_int], you have a 2^{-N} probability of conflict, which means an average number of weight choices of 1, and a O(n log n) algorithm (the only complexity comes from the sorting), assuming picking a random number is O(1).

If you pick random boolean streams, you never have to restart picking, but the complexity is then related to "how many elements of the streams are evaluated on average". I conjecture it is O(log n) in average (hence still a O(n log n) in total), but have no proof.

... and I think your algorithm works

After more reflexion, I think (like douplep), that your implementation is correct. Here is an informal explanation.

Each element in your list is tested by several random:uniform() < 0.5 tests. To an element, you can associate the list of outcomes of those tests, as a list of booleans or {0, 1}. At the beginning of the algorithm, you don't know the list associated to any of those number. After the first partition call, you know the first element of each list, etc. When your algorithm returns, the list of tests are completely known and the elements are sorted according to those lists (sorted in lexicographic order, or considered as binary representations of real numbers).

In substance, your algorithm is equivalent to my "sort over real numbers, as lazy lists of booleans". The action of partitioning the list, reminiscent of quicksort's partition over a pivot element, is actually a way of separating, for a given position in the binary decomposition of those numbers, the elements with valuation 0 from the elements with valuation 1.

The sort is uniform because the real number picked are all different. Indeed, two elements with real numbers equal upto the n-th decimal are on the same side of a partition occuring during a recursive shuffle call of depth n. The algorithm only terminates when all the list resulting from partitions are empty or singletons : all elements have been separated by at least one test, and therefore have one distinct binary decimal.

Probabilistic termination

A subtle point about your algorith (or my equivalent sort-based method) is that the termination condition is probabilistic. Fisher-Yates always terminate after a known number of steps (the number of elements in the array). With your algorithm, the termination depends on the output of the random number generator.

There are possible outputs that would make your algorithm diverge, not terminate. For example, if the random number generate always output 0, each partition call will return the input list unchanged, on which you recursively call the shuffle : you will loop indefinitely.

However, this is not an issue if you're confident that your random number generator is fair : it does not cheat and always return independent uniformly distributed results. In that case, the probability that the test random:uniform() < 0.5 always returns true (or false) is exactly 0 :

  • the probability that the first N calls return true is 2^{-N}
  • the probability that all calls return true is the probability of the infinite intersection, for all N, of the event that the first N calls return 0; it is the infimum limit¹ of the 2^{-N}, which is 0

¹: for the mathematical details, see http://en.wikipedia.org/wiki/Measure_(mathematics)#Measures_of_infinite_intersections_of_measurable_sets

More generally, the algorithm does not terminate if and only if some of the elements get associated to the same boolean stream. This means that at least two elements have the same boolean stream. But the probability that two random boolean streams are equal is again 0 : the probability that the digits at position K are equal is 1/2, so the probability that the N first digits are equal is 2^{-N}, and the same analysis applies.

Therefore, you know that your algorithm terminates with probability 1. This is a slightly weaker guarantee that the Fisher-Yates algorith, which always terminate. In particular, you're vulnerable to an attack of an evil adversary that would control your random number generator.

With more probability theory, you could also compute the distribution of running times of your algorithm for a given input length. This is beyond my technical abilities, but I assume it's good : I suppose that you only need to look at O(log N) first digits on average to check that all N lazy streams are different, and that the probability of much higher running times decrease exponentially.

gasche
My real question here, though, is what empirical tests can I throw at the output of my shuffler to see if it is plausibly shuffled? For example, that "pair a random weight with each element" approach tested badly even with my limited ability to test this stuff. (I tested the sequence [1,2] repeatedly and found a huge imbalance.)
JUST MY correct OPINION
`[min_int..max_int]` is not enough to bring the probability of conflict close to 0, because of the birthday problem you mentioned: with 32-bit ints, you already reach a 0.5 chance of conflict with a list of only ~77,000 items.
Piet Delport
Also, note that in general, making any sort-based shuffle perfectly uniform/correct is probably much harder than it seems at first: for some of the problems, see Oleg's writeup, and my answer's comments. If a perfect shuffle is important at all, it is certainly much easier and simpler to just use the Fisher–Yates algorithm.
Piet Delport
I edited to mention your warning about `[min_int..max_int]` : you're right and it doesn't scale to big sequences.I also included an implementation of the real-number based sort. I agree that Fisher-Yates is simpler, but I'm not sure Oleg's proposal is.
gasche
gasche: Oleg's proposal is just one way of expressing Fischer–Yates in Haskell, using immutable binary trees instead of mutable arrays for the extraction step (and you shouldn't count the binary tree implementation, which is generic, towards the algorithm itself). You can implement the basic algorithm using a number of other array-like backing structures, including Haskell's built-in trees, mutable arrays, or DiffArrays (or even more exotic structures like zippers).
Piet Delport