views:

193

answers:

3

This MSDN article proves the correctness of Reservoir Sampling algorithm as follows:


  1. Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k.

  2. The probability i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

  3. So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps) = s/k * k/(k+1), which is s/(k+1).

  4. So, when k+1 = n, any element is present with probability s/n.


about step 3:

  • What are the k+1 rounds mentioned?

  • What is chosen in k steps, and not removed in k steps?

  • Why are we only calculating this probability for elements that were already in R after the first s steps?

A: 

We're sampling s items from a stream of k items (where k is very large, so we process the stream item by item).

Processing each item from the stream is called a 'round'.

During a round, we perhaps replace one of the elements already present with the new item.

'Chosen in k steps' means that during the round where the item appeared in the stream, we chose to replace some other item with it (t.i. we didn't ignore it). 'Not removed in k steps' means that since that moment, we've not chosen to replace this item with some new item from the stream.

jkff
A: 
  • "k+1 rounds" means "after (k+1)-th item from the input sequence has been considered"
  • s/k is the probability that the given item will be in the reservoir after k steps (by induction), k/(k+1) is the probability that the item won't be replaced on the (k+1)-th step
  • we want to ensure that every input item remains in the reservoir with the same probability. so, in our calculations we are interested only in items that remain in the reservoir on each step.
Vlad
+1  A: 

Are you familiar with proof by induction? k is just the intermediate step of the algorithm, proving that the invariant is true all throughout, in this case, that the probability the k-th item will have the probability chosen s/k for all k.

Larry