I've been killing some time with a little puzzle project -- writing the algorithm described in Knuth for random sampling of a population without knowing the size of the population in advance. So far I've written it in JavaScript Rhino, Clojure, Groovy, Python, and Haskell. The basic parameters are:
- the function takes two arguments, a sample size and the population
- the algorithm should work on the language's basic representation of an iterable item -- e.g. seqs for Clojure, lazy lists for Haskell, iterators for Python, etc.
- the algorithm should work on a stream of values, and not require any more than O(sample-size) memory.
- the algorithm should not do any unnecessary comparisons
Would anyone like to add their own versions, or fix bugs or poor expression in my versions?
Haskell 1:
import Data.Array
import Random
import Monad
randomSample 0 _ = return []
randomSample sampleSize items =
(liftM elems) (foldM addItem
(listArray (0,(length sample) - 1) sample)
(zip [sampleSize..] others))
where (sample, others) = splitAt sampleSize items
addItem current item =
do index <- getStdRandom (randomR (0,fst item))
return (if index < sampleSize
then current // [(index, (snd item))]
else current)
Haskell 2 (looks more Haskellish to me):
import Data.Array
import Random
import Monad
randomSample 0 _ = return []
randomSample sampleSize items =
liftM elems $ foldM addItem (reservoir $ take sampleSize items) others
where reservoir x = listArray (0,length x - 1) x
others = zip [sampleSize..] $ drop sampleSize items
addItem current item =
do index <- getStdRandom $ randomR (0, fst item)
return $! if index < sampleSize
then current // [(index, snd item)]
else current