views:

291

answers:

2

How to randomly pick up N numbers from a vector a with weight assigned to each number?

Let's say:

a = 1:3; % possible numbers
weight = [0.3 0.1 0.2]; % corresponding weights

In this case probability to pick up 1 should be 3 times higher than to pick up 2.

Sum of all weights can be anything.

+8  A: 
R = randsample([1 2 3], N, true, [0.3 0.1 0.2])

randsample is included in the Statistics Toolbox


Otherwise you can use some kind of roulette-wheel selection process. See this similar question (although not MATLAB specific). Here's my one-line implementation:

a = 1:3;             %# possible numbers
w = [0.3 0.1 0.2];   %# corresponding weights
N = 10;              %# how many numbers to generate

R = a( sum( bsxfun(@ge, rand(N,1), cumsum(w./sum(w))), 2) + 1 )

Explanation:

Consider the interval [0,1]. We assign for each element in the list (1:3) a sub-interval of length proportionate to the weight of each element; therefore 1 get and interval of length 0.3/(0.3+0.1+0.2), same for the others.

Now if we generate a random number with uniform distribution over [0,1], then any number in [0,1] has an equal probability of being picked, thus the sub-intervals' lengths determine the probability of the random number falling in each interval.

This matches what I'm doing above: pick a number X~U[0,1] (more like N numbers), then find which interval it falls into in a vectorized way..


You can check the results of the two techniques above by generating a large enough sequence N=1000:

>> tabulate( R )
  Value    Count   Percent
      1      511     51.10%
      2      160     16.00%
      3      329     32.90%

which more or less match the normalized weights w./sum(w) [0.5 0.16667 0.33333]

Amro
@Amro: Thanks a lot for so detailed answer.
yuk
+4  A: 

amro gives a nice answer (that I rated up), but it will be highly intensive if you wish to generate many numbers from a large set. This is because the bsxfun operation can generate a huge array, which is then summed. For example, suppose I had a set of 10000 values to sample from, all with different weights? Now, generate 1000000 numbers from that sample.

This will take some work to do, since it will generate a 10000x1000000 array internally, with 10^10 elements in it. It will be a logical array, but even so, 10 gigabytes of ram must be allocated.

A better solution is to use histc. Thus...

a = 1:3
w = [.3 .1 .2];
N = 10;

[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R)
R =
     1     1     1     2     2     1     3     1     1     1

However, for a large problem of the size I suggested above, it is fast.

a = 1:10000;
w = rand(1,10000);
N = 1000000;

tic
[~,R] = histc(rand(1,N),cumsum([0;w(:)./sum(w)]));
R = a(R);
toc
Elapsed time is 0.120879 seconds.

Admittedly, my version takes 2 lines to write. The indexing operation must happen on a second line since it uses the second output of histc. Also note that I've used the ability of the new matlab release, with the tilde (~) operator as the first argument of histc. This causes that first argument to be immediately dumped in the bit bucket.

woodchips
@woodcihps, thank you for the nice solution. By the way, RANDSAMPLE function suggested by Amro also uses histc approach.
yuk
@woodschips: thanks for the improvement, looking at RANDSAMPLE it is using HISTC as well
Amro
this also reminded me of an old submission of yours on FEX: http://www.mathworks.com/matlabcentral/fileexchange/11-bindex-m
Amro

related questions