ansaurus

Question

Most efficient way of randomly choosing a set of distinct integers

Answer 1

+6 A:

For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.

If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:

Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.

If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.

Here is another algorithm that is simpler but has potentially unbounded execution time:

Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.

In practice if n is small and maxValue is large this will be good enough for most purposes.

Mark Byers 2010-09-15 22:52:25

I am not sure if I understand your algorithm correctly. Assume that maxValue is 1000. If I have `{1,4}` in list and random function return `3`, so I add `1` to it because there is one element that smaller than `3`. Now I got {1,4,4}. Sorry If I misunderstood.

tia 2010-09-16 02:21:03

@tia: he means, `for (l in list) if (l <= x) ++x;`. So after you've incremented `x` once, because "1" is in the list, you'll increment it again, because "4" is in the list, resulting in 5.

Steve Jessop 2010-09-16 03:11:14

The first approach uses space proportional to maxValue. The second one is O(n^2) time. The third has a reasonable expected running time (O(N Log N), but it is not bounded in the worst case, as you said. See my response, which offers a linear space/time solution in n.

Eyal Schneider 2010-09-16 07:58:41

Answer 2

+2 A:

One way to do it without generating the full array.

Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.

Consider element x1. I add x1 to my subset with probability m/n.

If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.

Lather, rinse, and repeat until m = 0.

This algorithm is O(n) where n is the number of items I have to consider.

I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!

Rafe 2010-09-15 23:06:32

I really like this idea... especially if it is possible to skip elements at the front to give the right distribution!

mikera 2010-09-27 09:51:52

Answer 3

+1 A:

If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).

If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.

If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).

These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.

(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)

Rex Kerr 2010-09-16 01:49:28

Answer 4

+1 A:

My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:

def pick(n, maxValue):
    chosen = []
    for i in range(n):
        r = random.randint(0, maxValue - i)
        for e in chosen:
            if e <= r:
                r += 1
            else:
                break;
        bisect.insort(chosen, r)
    return chosen

Sheldon L. Cooper 2010-09-16 03:47:47

Answer 5

+3 A:

Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).

It is based on Floyd's random sample algorithm. See my blog post about it for details. The code is in Java:

private static Random rnd = new Random();

public static Set<Integer> randomSample(int max, int n) {
    HashSet<Integer> res = new HashSet<Integer>(n);
    int count = max + 1;
    for (int i = count - n; i < count; i++) {
        Integer item = rnd.nextInt(i + 1);
        if (res.contains(item))
            res.add(i);
        else
            res.add(item);
    }
    return res;
}

Eyal Schneider 2010-09-16 07:51:50

+1 and nice article.

Mark Byers 2010-09-16 08:11:45

Nice article. I do find the idea that in case of collision I can just pick the "max" element (`i` here) counter-intuitive, care to enlighten me with "simple" words ?

Matthieu M. 2010-09-16 12:49:36

Answer 6

A:

Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...

tc. 2010-09-27 00:09:07

Surely that doesn't guarantee distinct values?

mikera 2010-09-27 09:41:03

With suitably chosen parameters, a LCG modulo m cycles through all values in [0, m-1]. This is one reason they're used as PRNGs (they eventually cycle through all possible output values and are therefore "uniform"). The Wikipedia page lists the necessary conditions (insert usual Wikipedia caveat): http://en.wikipedia.org/wiki/Linear_congruential_generator

tc. 2010-09-27 20:30:15

ansaurus

tags:

views:

answers:

Most efficient way of randomly choosing a set of distinct integers

related questions