views:

156

answers:

6

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].

Constraints:

  • maxValue is larger than n, and possibly much larger
  • I don't care if the output list is sorted or not
  • all integers must be chosen with equal probability

My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.

Any better solutions?

+6  A: 

For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.


If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:

  1. Keep a sorted list l of number picked so far, initially empty.
  2. Pick a random number x between 0 and maxValue - (elements in l)
  3. For each number in l if it smaller than or equal to x, add 1 to x
  4. Add the adjusted value of x into the sorted list and repeat.

If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.


Here is another algorithm that is simpler but has potentially unbounded execution time:

  1. Keep a set s of element picked so far, initially empty.
  2. Pick a number at random between 0 and maxValue.
  3. If the number is not in s, add it to s.
  4. Go back to step 2 until s has n elements.

In practice if n is small and maxValue is large this will be good enough for most purposes.

Mark Byers
I am not sure if I understand your algorithm correctly. Assume that maxValue is 1000. If I have `{1,4}` in list and random function return `3`, so I add `1` to it because there is one element that smaller than `3`. Now I got {1,4,4}. Sorry If I misunderstood.
tia
@tia: he means, `for (l in list) if (l <= x) ++x;`. So after you've incremented `x` once, because "1" is in the list, you'll increment it again, because "4" is in the list, resulting in 5.
Steve Jessop
The first approach uses space proportional to maxValue. The second one is O(n^2) time. The third has a reasonable expected running time (O(N Log N), but it is not bounded in the worst case, as you said. See my response, which offers a linear space/time solution in n.
Eyal Schneider
+2  A: 

One way to do it without generating the full array.

Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.

Consider element x1. I add x1 to my subset with probability m/n.

  • If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
  • If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.

Lather, rinse, and repeat until m = 0.

This algorithm is O(n) where n is the number of items I have to consider.

I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!

Rafe
I really like this idea... especially if it is possible to skip elements at the front to give the right distribution!
mikera
+1  A: 

If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).

If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.

If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).

These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.

(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)

Rex Kerr
+1  A: 

My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:

def pick(n, maxValue):
    chosen = []
    for i in range(n):
        r = random.randint(0, maxValue - i)
        for e in chosen:
            if e <= r:
                r += 1
            else:
                break;
        bisect.insort(chosen, r)
    return chosen
Sheldon L. Cooper
+3  A: 

Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).

It is based on Floyd's random sample algorithm. See my blog post about it for details. The code is in Java:

private static Random rnd = new Random();

public static Set<Integer> randomSample(int max, int n) {
    HashSet<Integer> res = new HashSet<Integer>(n);
    int count = max + 1;
    for (int i = count - n; i < count; i++) {
        Integer item = rnd.nextInt(i + 1);
        if (res.contains(item))
            res.add(i);
        else
            res.add(item);
    }
    return res;
}
Eyal Schneider
+1 and nice article.
Mark Byers
Nice article. I do find the idea that in case of collision I can just pick the "max" element (`i` here) counter-intuitive, care to enlighten me with "simple" words ?
Matthieu M.
A: 

Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...

tc.
Surely that doesn't guarantee distinct values?
mikera
With suitably chosen parameters, a LCG modulo m cycles through all values in [0, m-1]. This is one reason they're used as PRNGs (they eventually cycle through all possible output values and are therefore "uniform"). The Wikipedia page lists the necessary conditions (insert usual Wikipedia caveat): http://en.wikipedia.org/wiki/Linear_congruential_generator
tc.