views:

514

answers:

3

Given a series of incoming items, I want to assign each one to a bucket as it comes in. The bucket can either be a new one (one that has never been used before, of which there are an infinite supply) or it can be an existing bucket. If I look at the number of buckets with one item, the number with two, the number with three, etc., I want those bucket counts to follow an exponential distribution. I hope I'm saying that correctly -- if 80% of the buckets have 1 item, then 16% should have two, 3.2% should have three, etc. In general, the number of buckets of size k should be 1/p as many as the number of buckets of size k-1, and the fraction of buckets that are size 1 should be (1-p) to make the math work out right.

If I knew the number of items in advance, it would be easy: I'd know how many buckets of each size, so I can just pile the the items into as many buckets of each size as I needed. Or if I were just generating bucket sizes, it'd be easy even if I didn't know the total in advance: each new bucket has probability (1-p) of being size 1, (1-p)p of being size 2, (1-p)p^2 size 3, (1-p)p^3 size 4, etc.

But I am processing items, so as I get an item in, I need to choose a bucket: either an existing one, or a new one. If I make a new bucket, then I get one more of size 1. But if I choose an existing one of size k, then I get one more bucket of size k+1, and one fewer of size k. So what should the probability of choosing a bucket of size k be (where k can be zero to indicate making a new bucket)? And how does it relate to p?

I am doing this to generate test data where I only know the mean number of items per bucket (1.5, it turns out), don't know the number of items in advance, and think that the exponential distribution of bucket sizes is realistic for my situation. I'm probably making it more complicated than it needs to be, but I was surprised because I thought it would be easy.

And maybe it is, but I'm just missing something. (And I can't figure out how to Google it, either.)

+1  A: 

Here's my two cents: As you read items in, keep a counter of both buckets and items, and use this to calculate the number of items per bucket. If the items/buckets threshold goes over 1.5, you put the next item in a new bucket. If not, you assign the next item into a randomly selected existing bucket.

Most random number generators strive to generate statistically probable sets of random numbers (e.g., they avoid generating patterns), so my mathematical gut instinct says that this should work well enough to do what you need.

Don Werve
You may be right. I would need to randomly select an existing bucket with the technique given in ShreevatsaR's answer, not with a uniform sampling, and it's unclear to me what the final frequency of size-1 buckets in particular would be. But it may work out to be correct.
sfink
+4  A: 
ShreevatsaR
Doh! I'm supposed to know the difference between the exponential and geometric distributions; thanks.
sfink
Although I'm getting items one at a time (it's a server returning a bucket per query), your answer is easily adaptable (keep a "current bucket"). But I don't like that sequential items are in the same bucket; it would skew the test results. Even keeping a current bucket for each size skews things.
sfink
Couldn't you just do the above, and randomly permute the items at the end? You could also do the random "permutation" incrementally, as follows: after placing each item, swap it with an item randomly chosen among all items seen so far.
ShreevatsaR
The items have an order that they were sent to the server. Permuting the items doesn't affect the relationship between that ordering and the bucket selections. The server doesn't return items, just bucket ids for each query item in sequence. And I can't change bucket assignments of past items.
sfink
With those constraints, it seems a hard problem indeed. Forget the geometric distribution; what would you do even for the "uniform distribution" (not technically a distribution: where every bucket has exactly two items, say). How can you avoid sequential items being in the same (or nearby) buckets?
ShreevatsaR
A: 

Here's how I would do it.

For each item that comes in, draw a number from a uniform distribution over (0,1). This is your CDF for the geometric distribution which is, as the commenter above noted, what you are looking for. Then the bucket that you want is ln((1-CDF)/(1-p)). Suppose you only have placed items in buckets 1 and 2 so far and then you get an item destined for bucket 5, no problem, just use a hash table to keep track of the buckets that you have filled so far.

af