If you are selecting M
elements out of N
, the strategy changes depending on whether M
is of the same order as N
or much less (i.e. less than about N/log N).
If they are similar in size, then you go through each item from 1
to N
. You keep track of how many items you've got so far (let's call that m
items picked out of n
that you've gone through), and then you take the next number with probability (M-m)/(N-n)
and discard it otherwise. You then update m
and n
appropriately and continue. This is a O(N)
algorithm with low constant cost.
If, on the other hand, M
is significantly less than N
, then a resampling strategy is a good one. Here you will want to sort M
so you can find them quickly (and that will cost you O(M log M)
time--stick them into a tree, for example). Now you pick numbers uniformly from 1
to N
and insert them into your list. If you find a collision, pick again. You will collide about M/N
of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N)
selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M))
.
These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if M
is almost equal to N
, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N
numbers around, if your random number generation is expensive.)