views:

553

answers:

12

I need to generate n percentages (integers between 0 and 100) such that the sum of all n numbers adds up to 100.

If I just do nextInt() n times, each time ensuring that the parameter is 100 minus the previously accumulated sum, then my percentages are biased (i.e. the first generated number will usually be largest etc.). How do I do this in an unbiased way?

A: 

Once you pick the numbers with the method you describe, shuffle the order of the numbers. This way the final list of numbers has a more even distribution.

However, note that no matter what you do, you can't get a perfectly even distribution since once you start to choose numbers, your random trials are not independent. See ataylor's answer.

Note also that the algorithm you describe might not give you the required output. The last number cannot be random since it must make the sum equal 100.

Ben S
"not independent"! Of course! I had a feeling like this but couldn't put it into words.
Joachim Sauer
A: 

First, obvious solution.

do
    int[] a = new int[n];
    for (int i = 0; i < n; ++i) {
        a[i] = random number between 0 and 100;
    }
until sum(a) == 100;

It's not perfect in terms of complexity (number of iterations to reach sum 100 can be quite large), but distribution is surely 'unbiased'.

edit
Similar problem: how to generate random point in a circle with radius 1 and center in (0, 0)? Solution: continue generating random points in range (square) [-1..1,-1..1] until one of them fits the circle :)

Nikita Rybak
What if the first random number is 99 and the second one is 2?
Joachim Sauer
What if `n=3` and the first two numbers sum to > 100?
Ben S
@Joachim, @Ben Then final sum will be bigger than 100 and we'll have to repeat process.
Nikita Rybak
This won't work - The OP wants to generate a specific number of percentages but your code could potentially generate just 1 (if the first random number generated is 100).
Adamski
@Adamski Please check the code. It will _always_ generate n integers, even if first one is 100.
Nikita Rybak
@Nikita, That may be the case but it won't generate N integers that sum to 100 as requested.
Adamski
@Adamski You don't believe in power of random? :) Repeating 'do .. until' 1000 times should be enough for most values of n.
Nikita Rybak
By "most values of n" do you mean any integer between 1 and infinity?
Adamski
lols this is the equivalent of the bogosort. The algorithm is correct but will require an enormous amount of iterations to complete. I'm working out the number now! I hope you meant it as a joke :)
Il-Bhima
@Adamski Ok, smart pants, I mentioned complexity issue in my post :)
Nikita Rybak
@Il-Bhima yeah, except that success probability is pretty decent for not too big n (unlike 1/n! with bogosort) :)
Nikita Rybak
For two numbers the probability is 1/100, so you'd need 100 iterations for just two, and for n = 3 its much larger. I would say its at least exponential.
Il-Bhima
@Il-Bhima You're right, complexity will be big, even though it's not big at all for 3 :) With n == 3, maximum possible sum is 300. The most probable - 150 (p(150)>=1/300). We want 100, which is not so far away from 150 and not much less probable. Maybe, when I'm cured of laziness, I'll even test it to see how it works for n = 5, 10 or 15!
Nikita Rybak
+4  A: 

Generate n random integers with any range (call them a[1]..a[n]). Sum up your integers and call that b. Your percentages will be [a[1]/b, ..., a[n]/b].

Edit: good points, rounding the results to total exactly 100 is non-trival. One approach would be to take the floor of a[x]/b for x in 1..n as your integers, then distribute the remainding units 100-(sum of integers) randomly. I'm not sure if this would introduce any bias into the result.

ataylor
The only problem is that a[0]/b may be not integer.
Nikita Rybak
@Nikita: So what? Just round so that after rounding the sum = 100.
Ben S
this seems like a good solution. say the ratio turns out not to be an integer, could I then (at random) pick on of the numbers and add to it the difference between the total sum and 100? This seems to minimize the distribution bias.
erw
@Ben That tinkering will make distribution not so even (some combinations will happen more often than others). But I agree that it's pretty close if "any range" is big enough.
Nikita Rybak
-1: Sorry, this is useless and making it right will make it unnecessarily complicated. I believe mikera's solution is much better.
Moron
After doing this, you'll have a bunch of floats with fractional percentages that sum to 1. Do one more random() and assign that last % point based on the chance that each one gets it. So if you have (33.3,33.3 and 33.4), then the first two would have a 30% chance and the last one a 40% chance of getting that last % point.
LanceH
+1  A: 

To be precise it depends on exactly how you want the samples to be unbiased. Here is a rough way which will roughly give you a good result.

  1. Generate n-1 integers from 0,..100, say a[i] for i = 0, to n-2.
  2. Let total be the sum of these numbers
  3. Compute b[i] = floor(100*a[i]/total) for i = 0, to n-2
  4. Set b[n-1] = 100 - (b[0] + ... b[n-2]).

Then b is your resulting array of percentages.

The last one will be biased, but the rest should be uniform.

Of course if you want to do this in a more accurate way you'll have to use Gibbs sampling or Metropolis hastings.

Il-Bhima
I think this is the way to go. If you could remove the integer restriction, you could get rid of the rounding which may lead to some biases, but the whole process of genererating 100 numbers and then normalizing seems sound.
Grembo
+6  A: 

You possibly need to define what you really mean by "biased" - but if all you care about is that the distribution of the numbers is independent of their position, then you can simply create the numbers in a "biased" way and then randomise their positions.

Another "unbiased" method would be to create n-1 random percentages, sort them (call this x1 x2 x3...) and then define your final percentages to be:

x1
x2 - x1
x3 - x2
...
100 - x(n-1)

That way you will get n random numbers that add to 100.

mikera
@erw: This solution is much better than the one currently at the top (in number of votes)! It even has one random number call less and involves no rounding etc. Also, I think we can _prove_ that it generates each combination with the same probability.
Moron
+1 Same as Adamski's solution, and has same problem with zeroes. But perfect if all elements are > 0.
Nikita Rybak
@Nikita: If you view that as a generating a multi-set of n-1 numbers, then it gives an unbiased distribution (I think, but haven't tried proving it), as each distribution we need corresponds exactly to one multi-set. My statement about the one random number call less is not really accurate as generating random numbers one by one to create the multiset is not unbiased, as you say. In fact, I believe it is possible to do this with just _one_ random number call if n is small enough. See: http://stackoverflow.com/questions/3003192/five-unique-random-numbers-from-a-subset/3003253#3003253.
Moron
To add to previous comment. Since 100 choose 50 < 2^128, we can do this will just two 64 bit random number calls! (I am assuming n <= 100).
Moron
@Moron why would we want to limit ourselves by one random call? And the problem I mentioned (you were arguing with that, right?) is in the fact that he's not generating multisets correctly: there'll be two 'paths' to generate (1, 2) multiset (from sequence [1, 2] and [2, 1] and only one way for (1, 1). So, first one has twice as big probability.
Nikita Rybak
@Nikita: I agree completely. He is not generating multisets correctly. What I am saying is that a solution to generate the multiset (not a sequence on n-1 numbers) but a _multiset of n-1 numbers_, which we need for unifrom distribution, can be done using _one_ random number call. Maybe I should add that as an answer!
Moron
+3  A: 

The key is to generate N random numbers between 0 and 100 but to use these as "markers" rather than the final sequence of numbers to output. Then you iterate through your list of markers in ascending order, calculating each percentage to output as (current marker - previous marker).

This will give a much more even distribution than simply generating and outputting each number one at a time.

Example

import java.util.Random;
import java.util.TreeSet;
import java.util.SortedSet;

public class Main {
  public static void main(String[] args) {
    Random rnd = new Random();
    SortedSet<Integer> set = new TreeSet<Integer>();

    for (int i=0; i<9; ++i) {
      set.add(rnd.nextInt(101));
    }

    if (set.last() < 100) {
      set.add(100);
    }    

    int prev = 0;
    int total = 0;    
    int output;

    for (int j : set) {
      output = j - prev;
      total += output;
      System.err.println(String.format("Value: %d, Output: %d, Total So Far: %d", j, output, total));
      prev = j;
    }
  }
}

Output

$ java Main
Value: 0, Output: 0, Total So Far: 0
Value: 2, Output: 2, Total So Far: 2
Value: 55, Output: 53, Total So Far: 55
Value: 56, Output: 1, Total So Far: 56
Value: 57, Output: 1, Total So Far: 57
Value: 69, Output: 12, Total So Far: 69
Value: 71, Output: 2, Total So Far: 71
Value: 80, Output: 9, Total So Far: 80
Value: 92, Output: 12, Total So Far: 92
Value: 100, Output: 8, Total So Far: 100
Adamski
I suspect the distributions will be same as ataylor or my solutions. Yours however involves inserting into a BST so O(NlogN) rather than O(N). Also you need to add a last entry to get sum to 100
Il-Bhima
+1, I like the idea. Each valid distribut translates into a set of (n - 1) markers, so we could generate markers as well. The code looks quite difficult, though, I'm sure it can be made simpler :)
Nikita Rybak
It does have a problem with zeroes, though. Set of markers (1, 2, .., n - 1) will be generated in n! cases (considering order), while set of markers (100, 100, .., 100) (leading to distribution 100, 0, .., 0) will be generated only in one case.
Nikita Rybak
+2  A: 

Make an array. Randomly drop 100 %'s into each of the parts of that array. Example shows n=7.

import java.util.Random;

public class random100 {
    public static void main (String [] args) {
        Random rnd = new Random();
            int percents[] = new int[7];
            for (int i = 0; i < 100; i++) {
                int bucket = rnd.nextInt(7);
                percents[bucket] = percents[bucket] + 1;
            }
        for (int i = 0; i < 7; i++) {
            System.out.println("bucket " + i + ": " + percents[i]);
        }

    }

}
LanceH
Nope :) You program for n == 2 has only one way of reaching distribution (100, 0): rnd.nextInt == 0 on each step. But there're lots of ways to reach (50, 50): binomial coefficient from 50 and 100. And both distributions should have equal probability.
Nikita Rybak
rnd.nextInt(2) yields 0 or 1, seems to work just fine. I'm getting results of (46, 54), (48, 52), (56,44), etc...
LanceH
-1: Agree with Nikita.
Moron
@LanceH notice how your examples are clustered close to (50,50). Those closer to (50,50) have a much higher probability of occuring than e.g. (100, 0).
ataylor
Misread what Nikita wrote. Yes, it tends toward the middle, which looks particularly bad for n=2. But with n=2 the solution is a trivial through a simple random of the first number and the second number is determined.For n=3, no expectation is mentioned. If distribution is equal between all 3, then distribution for a single bucket is not going to be uniform from 0 to 100...which *could* imply the same thing for n=2.At the least it implies that n=2 is an entirely different beast than n>2.There is no mention of expected distribution.
LanceH
Exactly true that this answer is perfectly valid since no expected distribution was mentioned.
Kevin Bourrillion
A: 

I had a similar issue and ended up doing as you said, generating random integers up to the difference of the sum of the existing integers and the limit. I then randomized the order of the integers. It worked pretty well. That was for a genetic algorithm.

Caleb Thompson
+11  A: 

A couple of answers suggest picking random percents and taking the differences between them. As Nikita Ryback points out, this will not give the uniform distribution over all possibilities; in particular, zeroes will be less frequent than expected.

To fix this, think of starting with 100 'percents' and inserting dividers. I will show an example with 10:

 % % % % % % % % % % 

There are eleven places we could insert a divider: between any two percents or at the beginning or end. So insert one:

 % % % % / % % % % % % 

This represents choosing four and six. Now insert another divider. This time, there are twelve places, because the divider already inserted creates and extra one. In particular, there are two ways to get

 % % % % / / % % % % % % 

either inserting before or after the previous divider. You can continue the process until you have as many dividers as you need (one fewer than the number of percents.)

 % % / % / % / / % % % / % % % / 

This corresponds to 2,1,1,0,3,3,0.

We can prove that this gives the uniform distribution. The number of compositions of 100 into k parts is the binomial coefficient 100+k-1 choose k-1. That is (100+k-1)(100+k-2)...101 / (k-1)(k-2)*...*2*1 Thus the probability of choosing any particular composition is the reciprocal of this. As we insert dividers one at a time, first we choose from 101 positions, then 102, 103, etc until we get to 100+k-1. So the probability of any particular sequence of insertions is 1 / (100+k-1)*...*101. How many insertion sequences give rise to the same composition? The final composition contains k-1 dividers. They could have been inserted in any order, so there are (k-1)! sequences that give rise to a given composition. So the probability of any particular composition is exactly what it should be.

In actual code, you probably wouldn't represent your steps like this. You should be able to just hold on to numbers, rather than sequences of percents and dividers. I haven't thought about the complexity of this algorithm.

eruonna
+1 This should be equivalent to randomly picking a number that has already been drawn (with multiplicity) or a new one.
starblue
It is not because there is the possibility of choosing the space before or after a divider. Essentially, there are two ways to pick a number already drawn. (Or n+1 if that number has been drawn n times.)
eruonna
Oh, if you mean either picking from the list of already drawn numbers or any number from 0-100 (not necessarily new), then they are the same. You just have to make sure to give the same probability to any choice.
eruonna
Yes, that's the way to do it.
starblue
There's something fishy about this to me, although I don't understand it deeply. For a given result, there were multiple possible sequences of divider insertions that would lead to it. When the result contains no zeroes, there were only n! ways to get that result (corresponding to the different orders in which the dividers could have been placed). But when there are zeroes, there's more duplication than that... or is there? Combinatorics is hard.
Kevin Bourrillion
*"I haven't thought about the complexity of this algorithm."* A simple O(n) implementation would be to generate a number from 1-100, then another from 1-101, another from 1-102, ..., last from 1-(100+n-1). Note that when a new number is generated, all previous numbers larger than it need to be incremented by 1 (which signifies all larger elements being "bumped" to the right in your diagram above). The final set of numbers is the positions of all the dividers.
BlueRaja - Danny Pflughoeft
Come to think of it, that's `O(n^2)` - whoops!
BlueRaja - Danny Pflughoeft
@Kevin: There are always exactly (k-1)! ways to get a particular result. To see this, consider that once you know the result, you know the final configuration of dividers, you just don't know what order they were put down in. Any order is possible; you can derive the sequence of insertions by adding dividers one at a time in order.
eruonna
As far as I know you're right. Did I mention combinatorics is hard? And I even used to be good at it once upon a time. :)
Kevin Bourrillion
+5  A: 

This problem is known as uniform sampling from a simplex and Wikipedia gives two algorithms:

http://en.wikipedia.org/wiki/Simplex#Random_sampling

See also these related questions:

dreeves
+1 for giving a name
Joachim Sauer
Unfortunately the wikipedia article is a little inaccessible to most of this audience. I need it dumbed down. :)
Kevin Bourrillion
@Kevin: The first algorithm is the same as [ataylor's](http://stackoverflow.com/questions/3007975/java-random-percentages/3008067#3008067), the second is the same as [mikera's](http://stackoverflow.com/questions/3007975/java-random-percentages/3008078#3008078) (which unfortunately suffers from the problem Nikita brought up - it's mentioned in the article there's a known problem, but they don't specify what). [eruonna's solution](http://stackoverflow.com/questions/3007975/java-random-percentages/3009154#3009154) solves that problem - interesting that Wikipedia doesn't know about it yet :)
BlueRaja - Danny Pflughoeft
@BlueRaja: Not true about ataylor's solution. You have to generate from an exponential distribution and then normalize.
dreeves
@BlueRaja: eruonna's solution has the same (great!) idea as the proof that the total number of sets is 100+N-1 choose N. It uses too many random number calls though. Imagine generating such numbers a million times. Similar ideas can give us the set with just _one_ (or two) random number calls for n<=100. I have described that in my answer.
Moron
A: 

Imagine you have 100 stones and N buckets to place them in. You can take all 100 and place on in a random bucket. This way the total will be the 100 you started with and there will be no bias between any bucket.

public static int[] randomBuckets(int total, int n_buckets) {
    int[] buckets = new int[n_buckets];
    Random rand = new Random();
    for(int i=0;i<total;i++)
        buckets[rand.nextInt(n_buckets)]++;
    return buckets;
}

public static void main(String... args) {
    for(int i=2; i<=10;i++)
        System.out.println(Arrays.toString(randomBuckets(100, i)));
}

Prints

[55, 45]
[38, 34, 28]
[22, 21, 32, 25]
[28, 24, 18, 15, 15]
[17, 14, 13, 21, 18, 17]
[17, 19, 14, 15, 6, 15, 14]
[11, 14, 14, 14, 4, 17, 9, 17]
[13, 12, 15, 12, 8, 10, 9, 11, 10]
[11, 13, 12, 6, 6, 11, 13, 3, 15, 10]

As the count increases, the distribution approaches uniform.

System.out.println(Arrays.toString(randomBuckets(100000000, 100)));

Prints

[1000076, 1000612, 999600, 999480, 998226, 998303, 1000528, 1000450, 999529, 
998480, 998903, 1002685, 999230, 1000631, 1001171, 997757, 1000349, 1000527, 
1002408, 1000852, 1000450, 999318, 999453, 1000099, 1000759, 1000426, 999404, 
1000758, 1000939, 999950, 1000493, 1001396, 1001007, 999258, 1001709, 1000593,
1000614, 1000667, 1000168, 999448, 999350, 1000479, 999991, 999778, 1000513, 
998812, 1001295, 999314, 1000738, 1000211, 999855, 999349, 999842, 999635, 
999301, 1001707, 998224, 1000577, 999405, 998760, 1000036, 1000110, 1002471, 
1000234, 1000975, 998688, 999434, 999660, 1001741, 999834, 998855, 1001009, 
999523, 1000207, 998885, 999598, 998375, 1000319, 1000660, 1001727, 1000546, 
1000438, 999815, 998121, 1001128, 1000191, 998609, 998535, 999617, 1001895, 
999230, 998968, 999844, 999392, 999669, 999407, 998380, 1000732, 998778, 1000522]
Peter Lawrey
By the central limit theorem this will tend to a normal distribution about the centre (total/2) as n_buckets->infinity. I think what the OP wants is to have each possible combination equally likely (i.e a uniform distribution on the set of all n integers adding up to 100).
Il-Bhima
as I commented at top, I don't believe uniform is possible.
Kevin Bourrillion
@LL-Bhima Every bucket has an equal chance of getting being increamented, why would you suspect anything other than a equally flat distribution?
Peter Lawrey
@Kevin The result is not completely uniform, by the difference between the lowest and highest in the last example is much less than 1%.
Peter Lawrey
A: 

Assuming 1 <= n <= 100,

if you have to generate such a random set of numbers a couple of thousand times, I expect this method will be very useful, as we can create one set of n random numbers which sum to 100 with no more than two random number calls!

For sake of simplicity, we will discuss the case n=10, and it can be generalized easily.

The basic idea is this:

Say we are able to generate a multiset of numbers {x1 <= x2 <= ... <= x9}, each multiset being equally probable.

Then the set of numbers z1 = x1, z2 = x2-x1, ..., z9 = x9-x8, z10 = 100 - x9 are such that they sum to 100. Since x1, ..,x9 uniquely determine z1, ..., z10 and vice-versa, it is enough to concentrate on generating a multiset of 9 numbers.

To generate a multiset consider this:

Let y1 = x1, y2 = x2+1, .., y9 = x9 + 8.

Since x1 <= x2 ... <= x9, we have that y1 < y2 < ... < y9.

Also, we have that 0 <= yi <= 108.

Notice that the xi's uniquely determine the yis (and vice-versa).

Thus if we can generate the set of yi's uniformly, we are done.

Notice that the set of yi's is just a subset (a set, not a multiset) of 9 numbers from {0, 1, 2, ..., 108}.

To generate a random subset, now all you need to do is

This gives you a random set of yis, get back the xis, and then the zis to get a set of 10 numbers summing to 100.

If 1 <= n <= 100, we have that we need no more than two (or one, depending on n) random number calls, as the number get bigger than 64 bit integers, but fit within 128 bit (if my calculations are correct).

Moron