Generate exponential distribution of bucket sizes

Given a series of incoming items, I want to assign each one to a bucket as it comes in. The bucket can either be a new one (one that has never been used before, of which there are an infinite supply) or it can be an existing bucket. If I look at the number of buckets with one item, the number with two, the number with three, etc., I want those bucket counts to follow an exponential distribution. I hope I'm saying that correctly -- if 80% of the buckets have 1 item, then 16% should have two, 3.2% should have three, etc. In general, the number of buckets of size k should be 1/p as many as the number of buckets of size k-1, and the fraction of buckets that are size 1 should be (1-p) to make the math work out right.

If I knew the number of items in advance, it would be easy: I'd know how many buckets of each size, so I can just pile the the items into as many buckets of each size as I needed. Or if I were just generating bucket sizes, it'd be easy even if I didn't know the total in advance: each new bucket has probability (1-p) of being size 1, (1-p)p of being size 2, (1-p)p^2 size 3, (1-p)p^3 size 4, etc.

But I am processing items, so as I get an item in, I need to choose a bucket: either an existing one, or a new one. If I make a new bucket, then I get one more of size 1. But if I choose an existing one of size k, then I get one more bucket of size k+1, and one fewer of size k. So what should the probability of choosing a bucket of size k be (where k can be zero to indicate making a new bucket)? And how does it relate to p?

I am doing this to generate test data where I only know the mean number of items per bucket (1.5, it turns out), don't know the number of items in advance, and think that the exponential distribution of bucket sizes is realistic for my situation. I'm probably making it more complicated than it needs to be, but I was surprised because I thought it would be easy.

And maybe it is, but I'm just missing something. (And I can't figure out how to Google it, either.)

You may be right. I would need to randomly select an existing bucket with the technique given in ShreevatsaR's answer, not with a uniform sampling, and it's unclear to me what the final frequency of size-1 buckets in particular would be. But it may work out to be correct.

sfink 2009-04-04 23:41:45

Doh! I'm supposed to know the difference between the exponential and geometric distributions; thanks.

sfink 2009-04-04 21:59:51

Although I'm getting items one at a time (it's a server returning a bucket per query), your answer is easily adaptable (keep a "current bucket"). But I don't like that sequential items are in the same bucket; it would skew the test results. Even keeping a current bucket for each size skews things.

sfink 2009-04-04 22:05:44

Couldn't you just do the above, and randomly permute the items at the end? You could also do the random "permutation" incrementally, as follows: after placing each item, swap it with an item randomly chosen among all items seen so far.

ShreevatsaR 2009-04-04 22:24:57

The items have an order that they were sent to the server. Permuting the items doesn't affect the relationship between that ordering and the bucket selections. The server doesn't return items, just bucket ids for each query item in sequence. And I can't change bucket assignments of past items.

sfink 2009-04-04 23:38:57

With those constraints, it seems a hard problem indeed. Forget the geometric distribution; what would you do even for the "uniform distribution" (not technically a distribution: where every bucket has exactly two items, say). How can you avoid sequential items being in the same (or nearby) buckets?

ShreevatsaR 2009-04-05 13:56:15

ansaurus

tags:

views:

answers:

Generate exponential distribution of bucket sizes

related questions