views:

209

answers:

6

I'm either looking for an algorithm or a suggestion to improve my code to generate a list of random numbers that their sum equals some arbitrary number. With my code below, it'll always be biased as the first numbers will tend to be higher.

Is there a way to have the number selection more efficient?

#!/usr/bin/python
'''
  Generate a list of 'numbs' positive random numbers whose sum = 'limit_sum'
'''

import random


def gen_list(numbs, limit_sum):
  my_sum = []
  for index in range(0, numbs):
    if index == numbs - 1:
      my_sum.append(limit_sum - sum(my_sum))
    else:
      my_sum.append(random.uniform(0, limit_sum - sum(my_sum)))

  return my_sum

#test
import pprint
pprint.pprint(gen_list(5, 20))
pprint.pprint(gen_list(10, 200))
pprint.pprint(gen_list(0, 30))
pprint.pprint(gen_list(1, 10))

THE OUTPUT

## output

[0.10845093828525609,
 16.324799712999706,
 0.08200162072303821,
 3.4534885160590041,
 0.031259211932997744]

[133.19609626532952,
 47.464880208741029,
 8.556082341110228,
 5.7817325913462323,
 4.6342577008233716,
 0.22532341156764768,
 0.0027495225618908918,
 0.064738336208217895,
 0.028888697891734455,
 0.045250924420116689]

[]

[10]
+13  A: 

Why not just generate the right number of uniformly distributed random numbers, tot them up and scale ?

EDIT: To be a bit clearer: you want N numbers which sum to S ? So generate N uniformly distributed random numbers on the interval [0,1) or whatever your RNG produces. Add them up, they will total s (say) whereas you want them to total S, so multiply each number by S/s. Now the numbers are uniformly randomly distributed on [0,S/s) I think.

High Performance Mark
+1 Nice approach. Not what I had in mind :)
jensgram
+1 very nice idea
oezi
I'm not sure I follow a 100%, can you be a bit more descriptive please?
dassouki
-1. Scaling messes with the uniformly distributed-ness.
Jason S
"Scaling messes with the uniformly distributed-ness". Really? Why?
smirkingman
Because the totals are a function of the individual random numbers. Let's say there are 10 numbers and the desired total is 100, and you generate 10 numbers uniformly distributed from 0.0 to 1.0. The expected value of the sum is 5, std dev = sqrt(10/12), so most of the time the sum will be between 2 and 8, and therefore the scaling factor will usually be between 12.5 and 50. Therefore on very rare occasions you will get a scaled output number between 50 and 100: you need a small base sum with one of the numbers much larger than the rest.
Jason S
"most of the time the sum will be between 2 and 8": that's 3.3*sigma from the mean, or approx 0.1% of the time it will be outside that range.
Jason S
"Now the numbers are uniformly randomly distributed on [0,S/s)" -- NO!!!! By the time you know s, they are no longer random variables with distributions, they are specific numbers.
Jason S
@Jason S -- I feel bad gaining so much rep from what is a dodgy answer. Any more valid suggestions ?
High Performance Mark
@Mark: Well, I feel bad about sounding so negative about it; you made a legitimate attempt at a clear and simple answer. Unfortunately it has some statistical flaws.
Jason S
@Jason S: it's the end of the working day here, I'm going for a beer to help me get over the bad feelings I have. Suggest you do the same.
High Performance Mark
@Mark: good suggestion :-)
Jason S
A: 

You could keep a running total rather than having to call sum(my_sum) repeatedly.

ar
+9  A: 

Here's how I would do it:

  1. Generate n-1 random numbers, all in the range [0,max]
  2. Sort those numbers
  3. For each pair made up of the i-th and (i+1)-th number in sorted list, create an interval (i,i+1) and compute its length. The last interval will start at the last number and end at max and the first interval will start at 0 and end at the first number in the list.

Now, the lengths of those intervals will always sum up to max, since they simply represent segments inside [0,max].

Code (in Python):

#! /usr/bin/env python
import random

def random_numbers(n,sum_to):
    values=[0]+[random.randint(0,sum_to) for i in xrange(n-1)]+[sum_to]
    values.sort()
    intervals=[values[i+1]-values[i] for i in xrange(len(values)-1)]
    return intervals

if __name__=='__main__':
    print random_numbers(5,100)
MAK
I like it - I never thought of going about it that way.
neil
-1. This algorithm suffers from long tails as well. (no disrespect meant, it's a valiant effort)
Jason S
+1  A: 

The following is quite simple, and returns uniform results:

def gen_list(numbs, limit_sum):
    limits = sorted([random.uniform(0, limit_sum) for _ in xrange(numbs-1)])
    limits = [0] + limits + [limit_sum]
    return [x1-x0 for (x0, x1) in zip(limits[:-1], limits[1:])]

The idea is simply that if you need, say, 5 numbers between 0 and 20, you can simply put 4 "limits" between 0 and 20, and you get a partition of the (0, 20) interval. The random numbers that you want are simply the lengths of the 5 intervals in the sorted list [0, random1, random2, random3, random4, 20].

PS: oops! looks like it's the same idea as MAK's response, albeit coded without using indexes!

EOL
+6  A: 

If you are looking for normally-distributed numbers with as little correlation as possible, and need to be rigorous* about this, I would suggest you take the following mathematical approach and translate into code.

(*rigorous: the problem with other approaches is that you can get "long tails" in your distributions -- in other words, it is rare but possible to have outliers that are very different from your expected output)

  • Generate N-1 independent and identically distributed (IID) gaussian random variables v0, v1, v2, ... vN-1 to match the N-1 degrees of freedom of your problem.
  • Create a column vector V where V = [0 v0, v1, v2, ... vN-1]T
  • Use a fixed weighting matrix W, where W consists of an orthonormal matrix** whose top row is [1 1 1 1 1 1 1 ... 1] / sqrt(N).
  • Your output vector is the product WV + SU/N where S is the desired sum and U is the column vector of 1's. In other words, the i'th output variable = the dot product of (row #i of matrix W) and column vector V, added to S/N.

The standard deviation of each output variable will be (I believe, can't verify right now) sqrt(N/N-1) * the standard deviation of the input random variables.

**orthonormal matrix: this is the hard part, I put in a question at math.stackexchange.com and there's a simple matrix W that works, and can be defined algorithmically with only 3 distinct values, so that you don't actually have to construct the matrix.

W is the Householder reflection of v-w where v = [sqrt(N), 0, 0, 0, ... ] and w = [1 1 1 1 1 ... 1] can be defined by:

W(1,i) = W(i,1) = 1/sqrt(N)
W(i,i) = 1 - K   for i >= 2 
W(i,j) = -K      for i,j >= 2, i != j
K = 1/sqrt(N)/(sqrt(N)-1)

The problem with Mark's approach:

Why not just generate the right number of uniformly distributed random numbers, tot them up and scale ?

is that if you do this, you get a "long tail" distribution. Here's an example in MATLAB:

 >> X = rand(100000,10);
 >> Y = X ./ repmat(sum(X,2),1,10);
 >> plot(sort(Y))

I've generated 100,000 sets of N=10 numbers in matrix X, and created matrix Y where each row of Y is the corresponding row of X divided by its sum (so that each row of Y sums to 1.0)

Plotting the sorted values of Y (each column sorted separately) yields approximately the same cumulative distribution:

alt text

A true uniform distribution would yield a straight line from 0 to the maximum value. You'll notice that it's sort of vaguely similar to a true uniform distribution, except at the end where there's a long tail. There's an excess of numbers generated between 0.2 and 0.5. The tail gets worse for larger values of N, because although the average value of the numbers goes down (mean = 1 / N), the maximum value stays at 1.0: the vector consisting of 9 values of 0.0 and 1 value of 1.0 is valid and can be generated this way, but is pathologically rare.

If you don't care about this, go ahead and use this method. And there are probably ways to generate "almost"-uniform or "almost"-gaussian distributions with desired sums, that are much simpler and more efficient than the one I describe above. But I caution you to be careful and understand the consequences of the algorithm you choose.


One fixup that leaves things sort-of-uniformly distributed without the long tail, is as follows:

  1. Generate a vector V = N uniformly-distributed random numbers from 0.0 to 1.0.
  2. Find their sum S and their maximum value M.
  3. If S < k*M (maximum value is too much of an outlier), go back to step 1. I'm not sure what value to use for k, maybe k = N/2?
  4. Output the vector V*Sdesired/S

Example in MATLAB for N=10:

 >> X = rand(100000,10);
 >> Y = X ./ repmat(sum(X,2),1,10);
 >> i = sum(X,2)>(10/2)*max(X,[],2);
 >> plot(sort(Y(i,:)))

alt text

Jason S
+1 ............
High Performance Mark
This was more of what I was looking for :) I'll try it out and keep you updated
dassouki
+2  A: 

All right, we're going to tackle the problem assuming the requirement is to generate a random vector of length N that is uniformly distributed over the allowed space, restated as follows:

Given

  • a desired length L,
  • a desired total sum S,
  • a range of allowed values [0,B] for each scalar value,

generate a random vector V of length N such that the random variable V is uniformly distributed throughout its permitted space.


We can simplify the problem by noting that we can calculate V = U * S where U is a similar random vector with desired total sum 1, and a range of allowed values [0,b] where b = B/S. The value b must be between 1/N and 1.


First consider N = 3. The space of allowed values {U} is a portion of a plane perpendicular to the vector [1 1 1] that passes through the point [1/3 1/3 1/3] and which lies inside the cube whose components range between 0 and b. This set of points {U} is shaped like a hexagon.

(TBD: picture. I can't generate one right now, I need access to MATLAB or another program that can do 3D plots. My installation of Octave can't.)

It is best to use an orthonormal weighting matrix W (see my other answer) with one vector = [1 1 1]/sqrt(3). One such matrix is

octave-3.2.3:1> A=1/sqrt(3)
   A =  0.57735
octave-3.2.3:2> K=1/sqrt(3)/(sqrt(3)-1)
   K =  0.78868
octave-3.2.3:3> W = [A A A; A 1-K -K; A -K 1-K]
   W =

     0.57735   0.57735   0.57735
     0.57735   0.21132  -0.78868
     0.57735  -0.78868   0.21132

which, again, is orthonormal (W*W = I)

If you consider the points of the cube [0 0 b],[0 b b],[0 b 0],[b b 0],[b 0 0], and [b 0 b] these form a hexagon and are all a distance of b*sqrt(2/3) from the cube's diagonal. These do not satisfy the problem in question, but are useful in a minute. The other two points [0 0 0] and [b b b] are on the cube's diagonal.

The orthonormal weighting matrix W allows us to generate points that are uniformly distributed within {U}, because orthonormal matrices are coordinate transformations that rotate/reflect and do not scale or skew.

We will generate points that are uniformly distributed in the coordinate system defined by the 3 vectors of W. The first component is the axis of the diagonal of the cube. The sum of U's components depends completely upon this axis and not at all on the others. Therefore the coordinate along this axis is forced to be 1/sqrt(3) which corresponds to the point [1/3, 1/3, 1/3].

The other two components are in directions perpendicular to the cube's diagonal. Since the maximum distance from the diagonal is b*sqrt(2/3), we will generate uniformly distributed numbers (u,v) between -b*sqrt(2/3) and +b*sqrt(2/3).

This gives us a random variable U' = [1/sqrt(3) u v]. We then compute U = U' * W. Some of the resulting points will be outside the allowable range (each component of U must be between 0 and b), in which case we reject that and start over.

In other words:

  1. Generate independent random variables u and v that are each uniformly distributed between -b*sqrt(2/3) and +b*sqrt(3).
  2. Calculate the vector U' = [1/sqrt(3) u v]
  3. Compute U = U' * W.
  4. If any of U's components are outside the range [0,b], reject this value and go back to step 1.
  5. Calculate V = U * S.

The solution is similar for higher dimensions (uniformly distributed points within a portion of the hyperplane perpendicular to a hypercube's main diagonal):

Precalculate a weighting matrix W of rank N.

  1. Generate independent random variables u1, u2, ... uN-1 each uniformly distributed between -b*k(N) and +b*k(N).
  2. Calculate the vector U' = [1/N u1, u2, ... uN-1]
  3. Compute U = U' * W. (there are shortcuts to actually having to construct and multiply by W.)
  4. If any of U's components are outside the range [0,b], reject this value and go back to step 1.
  5. Calculate V = U * S.

The range k(N) is a function of N that represents the maximum distance of the vertices of a hypercube of side 1 from its main diagonal. I'm not sure of the general formula but it's sqrt(2/3) for N = 3, sqrt(6/5) for N = 5, there's probably a formula for it somewhere.

Jason S
+1 for clearly stating the question.
EOL