views:

1944

answers:

12

The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.

The algorithm provided here seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.

+1  A: 

Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.

Bill the Lizard
Yeah, that the way I did it when I needed 10M nonrepeating random numbers for a lottery
axk
Not too memory-efficient - need a K-sized auxiliary structure. In time, you need K insertions and N removals. The algorithm I found needs only (at most) K random draws.
tucuxi
You don't need an auxiliary structure at all. Just make the map your only structure. You'll always need K insertions to store K items. Why do you need N removals?
Bill the Lizard
Inserting into and checking the K-sized data structure isn't where problem with the trivial algo is, it's that as K -> N, your RNG will have a very high probability of generating a number you've already seen before when filling the end of the sequence. You need a hash map, but thats auxilliary.
Greg Rogers
+1  A: 

The following code (in C, unknown origin) seems to solve the problem extremely well:

 /* generate N sorted, non-duplicate integers in [0, max[ */
 int *generate(int n, int max) {
    int i, m, a;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    m = 0;
    for (i=0; i<max; i++) {
        a = random_in_between(0, max - i);
        if (a < n - m) {
            g[m] = i;
            m ++;
        }
    }
    return g;
 }

Does anyone know where I can find more gems like this one?

tucuxi
Programming Pearls by Jon Bentley (the pun on "gems" was intentional). :)
Bill the Lizard
What does "random_in_between" stands for?
Luis Filipe
This algorithm is terribly inefficient for small sample chosen from a large set. Picking 5 integers from a million takes one million calls to rand() instead of 5.
Rafał Dowgird
Thanks for the booktitle - I couldn't think of any other way to find it. Luis, random_in_between is for 'number between lo and hi, not including hi'. Praptak, perfectly true. Should have specified 'memory efficiency' versus 'time efficiency'. At least it's guaranteed to finish in bounded time...
tucuxi
+2  A: 

Generate an array 0...N filled a[i] = i.

Then shuffle the first K items.

Shuffling:

  • Start J = N-1
  • Pick a random number0...J (say R)
  • swap a[R] with a[J]
  • subtract 1 from J and repeat.

UPDATE: To respond to the comments: To produce, for example, two non-repeating int between 0...10 (K=2, N=10)

  • We start a sorted, non-repeating array: {0,1,2,3,4,5,6,7,8,9}
  • we then shuffle the first (actually last) K elements:

    • First, set J = 9.
    • R = random number 0..9, say R=6
    • Swap A[R] with A[J]. Array is now {0,1,2,3,4,9,6,7,8,5}
    • Subtract 1 from J and repeat (J=8)
    • R = random number 0..8, say R=3
    • Swap A[R] with A[J]. Array is now {0,1,8,3,4,9,6,7,2,5}

We can stop here as we've done K swap, and just take the last K elements of the array.

James Curran
Your approach is fine for generating permutations in [0, N[, but I want numbers in the range [0, K[. For instance, if N=2 and K=10, {5, 9} is a valid output sequence.
tucuxi
Then generate 0 .. K, and then remove numbers randomly until you have N numbers.
Dark Shikari
A: 

This is Perl Code. Grep is a filter, and as always I didn't test this code.

@list = grep ($_ % I) == 0, (0..N);
  • I = interval
  • N = Upper Bound

Only get numbers that match your interval via the modulus operator.

@list = grep ($_ % 3) == 0, (0..30);

will return 0, 3, 6, ... 30

This is pseudo Perl code. You may need to tweak it to get it to compile.

J.J.
+3  A: 

The random module from Python library makes it extremely easy and effective:

from random import sample
print sample(xrange(N), K)

sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.

DzinX
The python implementation is quite nice (see http://svn.python.org/view/python/trunk/Lib/random.py?view=markup, search for "sample"). They distinguish two cases, one for large K (K near N) and one for small K. For large K, they selectively copy elements over. For small K, they draw elements randomly, avoiding repetitions using a set.
tucuxi
A: 

The Reservoir Sampling version is pretty simple:

my $N = 20;
my $k;
my @r;

while(<>) {
  if(++$k <= $N) {
    push @r, $_;
  } elsif(rand(1) <= ($N/$k)) {
    $r[rand(@r)] = $_;
  }
}

print @r;

That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.

Michael Cramer
A: 

Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.

/* generate N sorted, non-duplicate integers in [0, max[  in O(N))*/
 int *generate(int n, int max) {
    float step,a,v=0;
    int i;    
    int *g = (int *)calloc(n, sizeof(int));
    if ( ! g) return 0;

    for (i=0; i<n; i++) {
        step = (max-v)/(float)(n-i);
        v+ = floating_pt_random_in_between(0.0, step*2.0);
        if ((int)v == g[i-1]){
          v=(int)v+1;             //avoid collisions
        }
        g[i]=v;
    }
    while (g[i]>max) {
      g[i]=max;                   //fix up overflow
      max=g[i--]-1;
    }
    return g;
 }
AShelly
A: 

My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.

  • First, generate a linked list with K elements, going from 0 to K
  • Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
  • Take that element, push it into another vector, and remove it from the original list

This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:

// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;

for(int i = 0; i < K; ++i) {
    sorted_list.push_back(i);
}

// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
    int rand_index = rand() % sorted_list.size();
    random_list.push_back(sorted_list.at(rand_index));
    sorted_list.erase(sorted_list.begin() + rand_index);
}                 

// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
    random_list.push_back(sorted_list.at(0));
}
Nik Reiman
+3  A: 

It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:

Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.

I wrote about this in much more detail here.

Nick Johnson
j_random_hacker
I don't have the theoretical basis to hand, but no, it doesn't destroy the 1-to-1 mapping properties of the block cipher. Xor folding is taken from the TEA cipher - perhaps check references on that for more detail.
Nick Johnson
@j_random_hacker: Of course, you're right. But it's nevertheless possible to come up with a pseudo random permutation using a custom Feistel cipher using the some cryptographic hash function as function F.
sellibitze
see here: http://stackoverflow.com/questions/196017/unique-random-numbers-in-o1/3094476#3094476
sellibitze
+2  A: 

In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:

Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.

S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)

S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.

S3. [Test.] If (N – t)U ≥ n – m, go to step S5.

S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.

S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.

An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:

(defun sample-list (n list &optional (length (length list)) result)
  (cond ((= length 0) result)
        ((< (* length (random 1.0)) n)
         (sample-list (1- n) (cdr list) (1- length)
                      (cons (car list) result)))
        (t (sample-list n (cdr list) (1- length) result))))

And here is an implementation that does not use recursion, and which works with all kinds of sequences:

(defun sample (n sequence)
  (let ((length (length sequence))
        (result (subseq sequence 0 n)))
    (loop
       with m = 0
       for i from 0 and u = (random 1.0)
       do (when (< (* (- length i) u) 
                   (- n m))
            (setf (elt result m) (elt sequence i))
            (incf m))
       until (= m n))
    result))
Vebjorn Ljosa
Thanks for the authoritative answer. I have the same requirement, and this is the algo I'm planning to implement. Thanks again.
sundar
A: 

If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).

edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.

/* Sampling according to [Vitter87].
 * 
 * Bibliography
 * [Vitter 87]
 *   Jeffrey Scott Vitter, 
 *   An Efficient Algorithm for Sequential Random Sampling
 *   ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
 */

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>

#include <iomanip>

#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>

using namespace std;

// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;

    // Define a random number generator and initialize it with a reproducible
    // seed.
    // (The seed is unsigned, otherwise the wrong overload may be selected
    // when using mt19937 as the base_generator_type.)
    base_generator_type generator(0xBB84u);
    //TODO : change the seed above !
    // Defines the suitable uniform ditribution.
    boost::uniform_real<> uni_dist(0,1);
    boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);



void SequentialSamplesMethodA(int K, int N) 
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method A.
    {
    int top=N-K, S, curr=0, currsample=-1;
    double Nreal=N, quot=1., V;

    while (K>=2)
        {
        V=uni();
        S=0;
        quot=top/Nreal;
        while (quot > V)
            {
            S++; top--; Nreal--;
            quot *= top/Nreal;
            }
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        Nreal--; K--;curr++;
        }
    // special case K=1 to avoid overflow
    S=floor(round(Nreal)*uni());
    currsample+=1+S;
    cout << curr << " : " << currsample << "\n";
    }

void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to 
// [Vitter87], method D. 
    {
    const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
    //optimized for an implementation in 1987 !!!
    int curr=0, currsample=0;
    int threshold=-negalphainv*K;
    double Kreal=K, Kinv=1./Kreal, Nreal=N;
    double Vprime=exp(log(uni())*Kinv);
    int qu1=N+1-K; double qu1real=qu1;
    double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
    int S, limit;
    while ((K>1)&&(threshold<N))
        {
        Kmin1inv=1./(Kreal-1.);
        while(1)
            {//Step D2: generate X and U
            while(1)
                {
                X=Nreal*(1-Vprime);
                S=floor(X);
                if (S<qu1) {break;}
                Vprime=exp(log(uni())*Kinv);
                }
            U=uni();
            negSreal=-S;
            //step D3: Accept ?
            y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
            Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
            if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
            //step D4 Accept ?
            y2=0; top=Nreal-1.;
            if (K-1 > S)
                {bottom=Nreal-Kreal; limit=N-S;}
            else {bottom=Nreal+negSreal-1.; limit=qu1;}
            for(int t=N-1;t>=limit;t--)
                {y2*=top/bottom;top--; bottom--;}
            if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
                {//Accept !
                Vprime=exp(log(uni())*Kmin1inv);
                break;
                }
            Vprime=exp(log(uni())*Kmin1inv);
            }
        // Step D5: Select the (S+1)th record
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        curr++;
        N-=S+1; Nreal+=negSreal-1.;
        K-=1; Kreal-=1; Kinv=Kmin1inv;
        qu1-=S; qu1real+=negSreal;
        threshold+=negalphainv;
        }
    if (K>1) {SequentialSamplesMethodA(K, N);}
    else {
        S=floor(N*Vprime);
        currsample+=1+S;
        cout << curr << " : " << currsample << "\n";
        }
    }


int main(void)
    {
    int Ntest=10000000, Ktest=Ntest/100;
    SequentialSamplesMethodD(Ktest,Ntest);
    return 0;
    }

$ time ./sampling|tail

gives the following ouptut on my laptop

99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745

real    0m0.075s
user    0m0.060s
sys 0m0.000s
Frédéric Grosshans
A: 

Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.

Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.

The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...

This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.

Brian