tags:

views:

138

answers:

2

Page 120 of Programming Pearls 1st edition presents this algorithm for selecting M equally probable random elements out of a population of N integers.

InitToEmpty
Size := 0
While Size < M do
  T := RandInt(1,N)
  if not Member(T)    
     Insert(T)
     Size := Size + 1

It is stated that the expected number of Member tests is less than 2M, as long as M < N/2.

I'd like to know how to prove it, but my algorithm analysis background is failing me.

I understand that the closer M is to N, the longer the program will take, because the result set will have more elements and the likelihood of RandInt selecting an existing one will increase proportionally.

Can you help me figuring out this proof?

+4  A: 

I am not a math wizard, but I will give it a rough shot. This is NOT guaranteed to be right though.

For each additional member of M, you pick a number, see if it's there, and if is add it. Otherwise, you try again. Trying something until you're successful is called a geometric probability distribution.

http://en.wikipedia.org/wiki/Geometric_distribution

So you are running M geometric trials. Each trial has expected value 1/p, so will take expected 1/p tries to get a number not already in M. p is N minus the number of numbers we've already added from M divided by N (i.e. how many unpicked items / total items). So for the fourth number, p = (N -3) / N, which is the probability of picking an unused number, so the expected number of picks for the third number is N / N-3 .

The expected value of the run time is all of these added together. So something like

E(run time) = N/N + N/(N -1) + N/(N -2 ) ... + N/ (N-M)

Now if M < N/2, then the last element in that summation is bounded above by 2. ((N/N/2) == 2)). It's also obviously the largest element in the whole summation. So if the biggest element is two picks, and there are M elements being summed, the EV of the whole run time is bounded above by 2M.

Ask me if any of this is unclear. Correct me if any of this is wrong :)

FranticPedantic
+1: I deleted my answer. That sum is approximately N*ln(N/(N-M+1)), so for M=N, we expect O(NlogN) tests.
Moron
Perfect. Now it makes total sense for me.
mpm
I would add parentheses: E(run time) = N/N + N/(N-1) + N/(N -2) ... + N/(N-M)
Eyal Schneider
+1  A: 

Say we have chosen K elements out of N. Then our next try has probability (N-K)/N of succeeding, so the number of tries that it takes to find the K + 1 st element is geometrically distributed with mean N/(N-K).

So if 2M < N we expect it to take less than two tries to get each element.

deinst
+1 for this too. Quick back of the envelope explanation :-)
Moron