views:

179

answers:

3

to retrieve k random numbers from an array of undetermined size we use a technique called reservoir sampling. Can anybody briefly highlight how it happens with a sample code??

+1  A: 

This page contains a good explanation with pseudo-code. (The Wikipedia page I originally linked to is unclear, and the pseudo-code is incomplete.)

Stephen C
Excellent link. Specially as the Wikipedia page is useless.
Lazer
@eSKay: So did you rewrite the Wikipedia page to properly describe the algorithm? (Still in pseudocode, of course.)
Donal Fellows
@Donal Fellows: no, not yet. Let me understand this algorithm a bit better :)
Lazer
@eSKay: Forget it. I did it myself. :-) The wikipedia page still needs some work to explain *why* the algorithm works (that MSDN page is still better) but now it's at least clear what the algorithm *is*.
Donal Fellows
A: 

I wrote a blog entry about the exact thing a couple of months back, which has a C# implementation: http://gregbeech.com/blog/sampling-very-large-sequences

The best description of how it works that I've found is here: http://gregable.com/2007/10/reservoir-sampling.html

Greg Beech
+3  A: 

I actually did not realize there was a name for this, so I proved and implemented this from scratch:

import random
def random_subset( iterator, K ):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len( result ) < K:
            result.append( item )
        else:
            s = int(random.random() * N)
            if s < K:
                result[ s ] = item

    return result

From: http://propersubset.com/2010/04/choosing-random-elements.html

with a proof near the end.

Larry
@Larry: Where is the `accept it with probability s/k` part in your code? [ quote from algorithm mentioned at http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx ]
Lazer
By coincidence, it seems like between that article and mine, we use the same variables, but for different things. My "K" appears to be their "S", and my "N" (in code) appears to be their "K". In other words, I accept things with `K/N` probability, where N is the current count of things.
Larry
@Larry: what I meant to ask was how you were implementing probability in your code. Anyways, now I understand. Thanks!
Lazer
Not quite understanding your question, but I can explain the code a little more if you are specific! =)
Larry