ansaurus

Question

Rosetta Stone: reservoir random sampling algorithm

Answer 1

+2 A:

Lua

function randomsample (samplesize, items)
    local sample = {}
    if samplesize == 0 then return sample end
    for num,item in ipairs(items) do
        if num <= samplesize then
            sample[num] = items[num]
        else
            local index = math.random(num)
            if index <= samplesize then sample[index] = item end
        end
    end
    return sample
end

Doug Currie 2008-11-26 19:02:24

I like that one. So many languages to learn, so little time.

Steven Huwig 2008-11-26 19:08:48

Answer 2

+1 A:

Ocaml, w/ streams:

(* [reservoir_sampling ?seed stream length] Creates an array of an unknown
 * number of values from [stream] of size [length].
 *
 * Caveat: if the stream is previously read it will return an incorrect
 * value of Stream.count that is used for the seen values. Change the code
 * to using a reference if this is a problem *)
val reservoir_sampling : ?seed:int -> 'a Stream.t -> int -> 'a array

let reservoir_sampling ?seed stream length =
    let () = match seed with | Some x -> Random.init x
                             | None -> Random.self_init ()
    and ray = Array.init length (fun _ -> Stream.next stream) in
    Stream.iter 
        (fun x ->
            let ran = Random.int (Stream.count stream) in
            if ran < length then Array.set ray ran x;
        ) stream;
    ray

nlucaroni 2008-12-01 18:58:21

Answer 3

+2 A:

The Haskell version is not very efficient. It relies on replacing the n-th item in the reservoir list by index, which is O(n).

A better solution is to use the array accumulation function "accum", which takes an accumulation function, an existing array, and a list of (index,value) pairs. The accumulation function in this case is (flip const), which always returns its second argument. So the reservoir is defined as the first n items in the list and the rest of the list xs1 is zipped with a stream of random numbers into the select function. Each item from xs1 is passed to accum if the corresponding random number is less than n. The stream of random numbers, rs, starts at a range [0,n] inclusive and the upper bound is then incremented for each successive number.

Also this version takes a random generator g instead of using the IO monad generator.

module Reservoir where

import Data.Array
import Data.Maybe
import System.Random

reservoirArray :: StdGen -> Int -> [a] -> Array Int a
reservoirArray g n xs =
    if null xs1 
       then listArray (0, length res - 1) res
       else accum (flip const) reservoir $ catMaybes $ 
            zipWith select (rs n g) xs1
    where
      rs k g1 = let (v, g2) = randomR (0, k) g1 in v : rs (k+1) g2
      (res, xs1) = splitAt n xs
      reservoir = listArray (0, n-1) res
      select r x = if r < n then Just (r, x) else Nothing


reservoirList :: StdGen -> Int -> [a] -> [a]
reservoirList g n xs = elems $ reservoirArray g n xs

Paul Johnson 2008-12-27 11:36:23

Thank you very much for your rewrite. I was under the impression that the reservoir was an array and hence replacement would be O(1). The "listArray" call in the definition of reservoir was supposed to do that.I definitely found that the random number generator was very slow in my implementation.

Steven Huwig 2009-01-05 15:30:45

`flip const == const id`, and the latter is shorter ;-) More seriously, you should use `Random g => g` instead of `StdGen` and return the final random state, or switch to using `Control.Monad.Random`.

ephemient 2010-02-14 19:40:35

Answer 4

A:

Groovy:

def randomSample(sampleSize, items) {
  def sample = []
  if (sampleSize == 0)
    return sample
  items = items.iterator()
  def count = 0;
  while (count++ < sampleSize && items.hasNext()) {
    sample.push(items.next())
  }
  items.eachWithIndex { item, i -> 
    def index = new Random().nextInt(i + sampleSize)
    if (index < sampleSize)
      sample[index] = item
  }
  return sample;
}

Steven Huwig 2009-02-11 14:36:07

Answer 5

A:

Python:

from random import randint

def random_sample(sample_size, items):
    if sample_size == 0:
        return []
    sample = []
    for num, item in enumerate(items):
        if num < sample_size:
            sample.append(item)
        else:
            try:
                sample[randint(0, num)] = item
            except IndexError:
                pass
    return sample

Steven Huwig 2009-02-11 14:36:43

@Beni Cherniavsky-Paskin: thanks for the edit, but this new implementation compares num to sample_size on every iteration, which is unnecessary.

Steven Huwig 2010-02-15 01:30:09

Answer 6

A:

JavaScript Rhino:

function randomSample(sampleSize, items) {
    var sample = [];
    if (sampleSize == 0)
        return sample;
    items = Iterator(items);
    for each (let [num, item] in items) {
        if (num >= sampleSize)
            break;
        sample.push(item);
    }
    for each (let [num, item] in items) {
        var index = Math.floor(Math.random() * num);
        if (index < sampleSize)
            sample[index] = item;
    }
    return sample;
}

Steven Huwig 2009-02-11 14:37:07

Answer 7

A:

Clojure:

(defn random-sample [sample-size items]
  (if (= sample-size 0) []
      (let [[sample remaining-items] (split-at sample-size items)]
        (second (reduce (fn [[num sample] item]
                          (let [index (rand-int num)]
                            (if (< index sample-size)
                              [(inc num) (assoc sample index item)]
                              [(inc num) sample])))
                        [sample-size (vec sample)]
                        remaining-items)))))

Steven Huwig 2009-02-11 14:37:40

Answer 8

A:

PLT Scheme with lots of streams:

(require srfi/27)
(require srfi/41)

(define (random-sample size input)
  (let ((first-part (stream-take size input))
        (second-part (stream-drop size input))
        (pool (make-vector size)))
    (stream-for-each (match-lambda ((list i val)
                                    (vector-set! pool i val)))
                     (stream-zip (stream-from 0) first-part))
    (stream-for-each (match-lambda ((list i val)
                                    (let ((random-index (random-integer i)))
                                      (when (< random-index size)
                                        (vector-set! pool random-index val)))))
                     (stream-zip (stream-from size) second-part))
    (vector->list pool)))

Steven Huwig 2009-02-26 03:01:43

Answer 9

+1 A:

Note that in your solution, there is a big bias for the order of the items, in the first items - since the first n items are already placed in a specific location.

A 'clean' random.sample may, for instance, return the number 2 when sampling 5 elements out of 1-100, in any of the five places. However, this solution may only return in in the second place. Although the latest items will be returned in any location, the first ones are causing the bias. So although a set-test (just checking the elements returned) will be legit, the order is not very random. Perhaps re-shuffling the sample before returning, would be better (another flag-parameter?), or it can be inserted into the beginning of the function, in a more complex way..

S.

s.kap 2010-06-20 12:26:26

Since samples are sets and sets are unordered, I didn't really care about the order. In fact, my typical use case is to resort them according to the original order in the file, since that's the only way that join(1) can be used to select records in other files.

Steven Huwig 2010-06-22 00:43:47

Besides this is really Waterman's solution, not mine. ;)

Steven Huwig 2010-06-22 00:46:37

ansaurus

tags:

views:

answers:

Rosetta Stone: reservoir random sampling algorithm

related questions