views:

280

answers:

4

Lets say we have numbers from 1 to 25 and we have to choose sets of 15 numbers.

The possible sets are, if i'm right 3268760.

Of those 3268760 options, you have to generate say 100000

What would be the best way to generate 100000 unique and random of that subsets?

Is there a way, an algorithm to do that?

If not, what would be the best option to detect duplicates?

I'm planning to do this on PHP but a general solution would be enough, and any reference not to much 'academic' (more practical) would help me a lot.

+2  A: 

Do they have to be truly random? Or seemingly random?

Selection: generate a set with all 25 - "shuffle" the first 15 elements using Fisher-Yates / the Knuth shuffle, and then check if you've seen that permutation of the first 15 elements before. If so, disregard, and retry.

Duplicates: You have 25 values that are there or not - this can be trivially hashed to an integer value (if the 1st element is present, add 2^0, if the second is, add 2^1, etc. - it can be directly represented as a 25 bit number), so you can check easily if you've seen it already.

You'll get a fair bit of collisions, but if it's not a performance critical snippet, it might be doable.

James
Even when choosing the last (100K) subset, there's only a 1/32 chance of picking one that's been had before. So discard-if-equal has a less than 3% overhead above some hypothetical algorithm that takes the same time to generate a candidate, and doesn't need to discard anything. If you needed 1.6M subsets, it would be a different matter.
Steve Jessop
+2  A: 

The random number generator (RNG) of your environment will supply you random numbers that are evenly distributed in a particular range. This type of distribution is often what is needed, say if your subset simulate lottery drawings, but it is important to mention this fact in case your are modeling say the age of people found on the grounds of a middle school...

Given this RNG you can "draw" 10 (or 15, read below) numbers between 1 and 25. This may require that you multiply (and round) the random number produced by the generator, and that you ignore numbers that are above 25 (i.e. draw again), depending on the exact API associated with the RNG, but again getting a drawing in a given range is trivial. You will also need to re-draw when a number comes up again.

I suggest you get 10 numbers only, as these can be removed from the 1-25 complete sequence to produce a set of 15. In other words drawing 15 to put in is the same drawing 10 to take out...

Next you need to assert the uniqueness of the sets. Rather than storing the whole set, you can use a hash to identify each set uniquely. This should take fewer that 25 bits, so can be stored on a 32 bits integer. You then need to have an efficient storage for up to 100,000 of these values; unless you want to store this in a database.

On this question of uniqueness of 100,000 sets taken out of all the possible sets, the probability of a collision seems relatively low. Edit: Oops... I was optimistic... This probability is not so low, with about 1.5% chance of a collision starting after drawing the 50,000th, there will be quite a few collisions, enough to warrant a system to exclude them...

mjv
Save some hash time by storing a hash of the elements you're excluding!
timdev
Right, that won't affect the size of the hash but its calculation will be faster if we base it on the excluded numbers.
mjv
+4  A: 

There is a way to generate a sample of the subsets that is random, guaranteed not to have duplicates, uses O(1) storage, and can be re-generated at any time. First, write a function to generate a combination given its lexical index. Second, use a pseudorandom permutation of the first Combin(n, m) integers to step through those combinations in a random order. Simply feed the numbers 0...100000 into the permutation, use the output of the permutation as input to the combination generator, and process the resulting combination.

Theran
Thank you, I learned something today!
redtuna
Yep, very useful references. Thank you Theran. Lexicographical indexing is certainly a handy technique. I don't quite see the pseudorandom permutation part, but on second read this will come clear I'm sure.
mjv
+2  A: 

Here's a solution in PHP based on mjv's answer, which is how I was thinking about it. If you run it for a full 100k sets, you do indeed see a lot of collisions. However, I'm hard pressed to devise a system to avoid them. Instead, we just check them fairly quickly.

I'll think about better solutions ... on this laptop, I can do 10k sets in 5 seconds, 20k sets in under 20 seconds. 100k takes several minutes.

The sets are represented as (32-bit) ints.

<?PHP
    /* (c) 2009 tim - anyone who finds a use for this is very welcome to use it with no restrictions unless they're making a weapon */

    //how many sets shall we generate?
    $gNumSets = 1000;

    //keep track of collisions, just for fun.
    $gCollisions = 0;

    $starttime = time();

    /**
     * Generate and return an integer with exactly 15 of the lower 25 bits set (1) and the other 10 unset (0)
     */ 
    function genSetHash(){
      $hash = pow(2,25)-1;

      $used = array();

      for($i=0;$i<10;){

        //pick a bit to turn off
        $bit = rand(0,24);

        if (! in_array($bit,$used)){
          $hash =  ( $hash & ~pow(2,$bit) );
          $i++;  
          $used[] = $bit;  
        }
      }
      return  $hash;
    }

    //we store our solution hashes in here.  
    $solutions = array();

    //generate a bunch of solutions.
    for($i=0;$i<$gNumSets;){
      $hash = genSetHash(); 

      //ensure no collisions
      if (! in_array($hash,$solutions)){
        $solutions[] = $hash;
        //brag a little.
        echo("Generated $i random sets in " . (time()-$starttime) . " seconds.\n");
        $i++;
      }else { 
        //there was a collision. There will generally be more the longer the process runs.
        echo "thud.\n"; 
        $gCollisions++;
      }
    }

    // okay, we're done with the hard work.  $solutions contains a bunch of
    // unique, random, ints in the right range.  Everything from here on out
    // is just output.

    //takes an integer with 25 significant digits, and returns an array of 15 numbers between 1 and 25
    function hash2set($hash){
      $set = array();
      for($i=0;$i<24;$i++){  
        if ($hash & pow(2,$i)){
          $set[] = $i+1;
        }
      }
      return $set;
    }

    //pretty-print our sets.
    function formatSet($set){
      return "[ " . implode(',',$set) . ']';
    }

    //if we wanted to print them, 
    foreach($solutions as $hash){
      echo formatSet(hash2set($hash)) . "\n";
    }

    echo("Generated $gNumSets unique random sets in " . (time()-$starttime) . " seconds.\n");

    echo "\n\nDone.  $gCollisions collisions.\n";

I think it's all correct, but it's late, and I've been enjoying several very nice bottles of beer.

timdev
+1 for the functional PHP script. The algorithm isn't mathematically as elegant as the one suggested by Theran but practical all the same, and it too can re-generate subsets from their ID/hash. I don't think that the overhead associated with collisions is a factor (roughly less than 2% overall, as pointed out by onebyone). Not bad for a CWI (Coding While...) effort.
mjv
All the answers were excellent! Giving tim the accept for code. Too bad one can not accept more then one aswer
Cesar