to retrieve k random numbers from an array of undetermined size we use a technique called reservoir sampling. Can anybody briefly highlight how it happens with a sample code??
+1
A:
This page contains a good explanation with pseudo-code. (The Wikipedia page I originally linked to is unclear, and the pseudo-code is incomplete.)
Stephen C
2010-04-10 07:57:09
Excellent link. Specially as the Wikipedia page is useless.
Lazer
2010-04-11 06:08:20
@eSKay: So did you rewrite the Wikipedia page to properly describe the algorithm? (Still in pseudocode, of course.)
Donal Fellows
2010-04-11 06:18:39
@Donal Fellows: no, not yet. Let me understand this algorithm a bit better :)
Lazer
2010-04-11 06:42:31
@eSKay: Forget it. I did it myself. :-) The wikipedia page still needs some work to explain *why* the algorithm works (that MSDN page is still better) but now it's at least clear what the algorithm *is*.
Donal Fellows
2010-04-11 06:55:02
A:
I wrote a blog entry about the exact thing a couple of months back, which has a C# implementation: http://gregbeech.com/blog/sampling-very-large-sequences
The best description of how it works that I've found is here: http://gregable.com/2007/10/reservoir-sampling.html
Greg Beech
2010-04-10 08:13:02
+3
A:
I actually did not realize there was a name for this, so I proved and implemented this from scratch:
import random
def random_subset( iterator, K ):
result = []
N = 0
for item in iterator:
N += 1
if len( result ) < K:
result.append( item )
else:
s = int(random.random() * N)
if s < K:
result[ s ] = item
return result
From: http://propersubset.com/2010/04/choosing-random-elements.html
with a proof near the end.
Larry
2010-04-10 08:59:03
@Larry: Where is the `accept it with probability s/k` part in your code? [ quote from algorithm mentioned at http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx ]
Lazer
2010-04-11 10:26:06
By coincidence, it seems like between that article and mine, we use the same variables, but for different things. My "K" appears to be their "S", and my "N" (in code) appears to be their "K". In other words, I accept things with `K/N` probability, where N is the current count of things.
Larry
2010-04-11 16:14:21
@Larry: what I meant to ask was how you were implementing probability in your code. Anyways, now I understand. Thanks!
Lazer
2010-04-12 03:11:57
Not quite understanding your question, but I can explain the code a little more if you are specific! =)
Larry
2010-04-12 04:31:36