views:

1231

answers:

9

I've been killing some time with a little puzzle project -- writing the algorithm described in Knuth for random sampling of a population without knowing the size of the population in advance. So far I've written it in JavaScript Rhino, Clojure, Groovy, Python, and Haskell. The basic parameters are:

  • the function takes two arguments, a sample size and the population
  • the algorithm should work on the language's basic representation of an iterable item -- e.g. seqs for Clojure, lazy lists for Haskell, iterators for Python, etc.
  • the algorithm should work on a stream of values, and not require any more than O(sample-size) memory.
  • the algorithm should not do any unnecessary comparisons

Would anyone like to add their own versions, or fix bugs or poor expression in my versions?

Haskell 1:

import Data.Array
import Random
import Monad

randomSample 0 _ = return []
randomSample sampleSize items = 
    (liftM elems) (foldM addItem
                   (listArray (0,(length sample) - 1) sample)
                   (zip [sampleSize..] others))
    where (sample, others) = splitAt sampleSize items
          addItem current item =
              do index <- getStdRandom (randomR (0,fst item))
                 return (if index < sampleSize 
                         then current // [(index, (snd item))]
                         else current)

Haskell 2 (looks more Haskellish to me):

import Data.Array
import Random
import Monad

randomSample 0 _ = return []
randomSample sampleSize items = 
    liftM elems $ foldM addItem (reservoir $ take sampleSize items) others
    where reservoir x = listArray (0,length x - 1) x
          others = zip [sampleSize..] $ drop sampleSize items
          addItem current item =
              do index <- getStdRandom $ randomR (0, fst item)
                 return $! if index < sampleSize 
                           then current // [(index, snd item)]
                           else current
+2  A: 

Lua

function randomsample (samplesize, items)
    local sample = {}
    if samplesize == 0 then return sample end
    for num,item in ipairs(items) do
        if num <= samplesize then
            sample[num] = items[num]
        else
            local index = math.random(num)
            if index <= samplesize then sample[index] = item end
        end
    end
    return sample
end
Doug Currie
I like that one. So many languages to learn, so little time.
Steven Huwig
+1  A: 

Ocaml, w/ streams:

(* [reservoir_sampling ?seed stream length] Creates an array of an unknown
 * number of values from [stream] of size [length].
 *
 * Caveat: if the stream is previously read it will return an incorrect
 * value of Stream.count that is used for the seen values. Change the code
 * to using a reference if this is a problem *)
val reservoir_sampling : ?seed:int -> 'a Stream.t -> int -> 'a array

let reservoir_sampling ?seed stream length =
    let () = match seed with | Some x -> Random.init x
                             | None -> Random.self_init ()
    and ray = Array.init length (fun _ -> Stream.next stream) in
    Stream.iter 
        (fun x ->
            let ran = Random.int (Stream.count stream) in
            if ran < length then Array.set ray ran x;
        ) stream;
    ray
nlucaroni
+2  A: 

The Haskell version is not very efficient. It relies on replacing the n-th item in the reservoir list by index, which is O(n).

A better solution is to use the array accumulation function "accum", which takes an accumulation function, an existing array, and a list of (index,value) pairs. The accumulation function in this case is (flip const), which always returns its second argument. So the reservoir is defined as the first n items in the list and the rest of the list xs1 is zipped with a stream of random numbers into the select function. Each item from xs1 is passed to accum if the corresponding random number is less than n. The stream of random numbers, rs, starts at a range [0,n] inclusive and the upper bound is then incremented for each successive number.

Also this version takes a random generator g instead of using the IO monad generator.

module Reservoir where

import Data.Array
import Data.Maybe
import System.Random

reservoirArray :: StdGen -> Int -> [a] -> Array Int a
reservoirArray g n xs =
    if null xs1 
       then listArray (0, length res - 1) res
       else accum (flip const) reservoir $ catMaybes $ 
            zipWith select (rs n g) xs1
    where
      rs k g1 = let (v, g2) = randomR (0, k) g1 in v : rs (k+1) g2
      (res, xs1) = splitAt n xs
      reservoir = listArray (0, n-1) res
      select r x = if r < n then Just (r, x) else Nothing


reservoirList :: StdGen -> Int -> [a] -> [a]
reservoirList g n xs = elems $ reservoirArray g n xs
Paul Johnson
Thank you very much for your rewrite. I was under the impression that the reservoir was an array and hence replacement would be O(1). The "listArray" call in the definition of reservoir was supposed to do that.I definitely found that the random number generator was very slow in my implementation.
Steven Huwig
`flip const == const id`, and the latter is shorter ;-) More seriously, you should use `Random g => g` instead of `StdGen` and return the final random state, or switch to using `Control.Monad.Random`.
ephemient
A: 

Groovy:

def randomSample(sampleSize, items) {
  def sample = []
  if (sampleSize == 0)
    return sample
  items = items.iterator()
  def count = 0;
  while (count++ < sampleSize && items.hasNext()) {
    sample.push(items.next())
  }
  items.eachWithIndex { item, i -> 
    def index = new Random().nextInt(i + sampleSize)
    if (index < sampleSize)
      sample[index] = item
  }
  return sample;
}
Steven Huwig
A: 

Python:

from random import randint

def random_sample(sample_size, items):
    if sample_size == 0:
        return []
    sample = []
    for num, item in enumerate(items):
        if num < sample_size:
            sample.append(item)
        else:
            try:
                sample[randint(0, num)] = item
            except IndexError:
                pass
    return sample
Steven Huwig
@Beni Cherniavsky-Paskin: thanks for the edit, but this new implementation compares num to sample_size on every iteration, which is unnecessary.
Steven Huwig
A: 

JavaScript Rhino:

function randomSample(sampleSize, items) {
    var sample = [];
    if (sampleSize == 0)
        return sample;
    items = Iterator(items);
    for each (let [num, item] in items) {
        if (num >= sampleSize)
            break;
        sample.push(item);
    }
    for each (let [num, item] in items) {
        var index = Math.floor(Math.random() * num);
        if (index < sampleSize)
            sample[index] = item;
    }
    return sample;
}
Steven Huwig
A: 

Clojure:

(defn random-sample [sample-size items]
  (if (= sample-size 0) []
      (let [[sample remaining-items] (split-at sample-size items)]
        (second (reduce (fn [[num sample] item]
                          (let [index (rand-int num)]
                            (if (< index sample-size)
                              [(inc num) (assoc sample index item)]
                              [(inc num) sample])))
                        [sample-size (vec sample)]
                        remaining-items)))))
Steven Huwig
A: 

PLT Scheme with lots of streams:

(require srfi/27)
(require srfi/41)

(define (random-sample size input)
  (let ((first-part (stream-take size input))
        (second-part (stream-drop size input))
        (pool (make-vector size)))
    (stream-for-each (match-lambda ((list i val)
                                    (vector-set! pool i val)))
                     (stream-zip (stream-from 0) first-part))
    (stream-for-each (match-lambda ((list i val)
                                    (let ((random-index (random-integer i)))
                                      (when (< random-index size)
                                        (vector-set! pool random-index val)))))
                     (stream-zip (stream-from size) second-part))
    (vector->list pool)))
Steven Huwig
+1  A: 

Note that in your solution, there is a big bias for the order of the items, in the first items - since the first n items are already placed in a specific location.

A 'clean' random.sample may, for instance, return the number 2 when sampling 5 elements out of 1-100, in any of the five places. However, this solution may only return in in the second place. Although the latest items will be returned in any location, the first ones are causing the bias. So although a set-test (just checking the elements returned) will be legit, the order is not very random. Perhaps re-shuffling the sample before returning, would be better (another flag-parameter?), or it can be inserted into the beginning of the function, in a more complex way..

S.

s.kap
Since samples are sets and sets are unordered, I didn't really care about the order. In fact, my typical use case is to resort them according to the original order in the file, since that's the only way that join(1) can be used to select records in other files.
Steven Huwig
Besides this is really Waterman's solution, not mine. ;)
Steven Huwig