views:

129

answers:

4

What is the appropriate data structure for this task?

I have a set of N items. N is large. Each item has a positive weight value associated with it.

I would like to do the following, quickly:

inner loop:

Sample an item, according it its weight.

[process...]

Update the weight of K items, where K << N.

When I say sample by weight, this is different than uniform sampling. An item's likelihood is in proportion to its weight. So if there are two items, and one has weight .8 and one has weight .2, then they have likelihood 80% and 20% respectively.

The number of items N remains fixed. Weights are in a bounded range, say [0, 1]. Weights do not always sum to one.

A naive approach takes O(n) time steps to sample. Is there an O(log(n)) algorithm?

What is the appropriate data structure for this task? I believe that red-black trees are inappropriate, since they treat each item as having equal weight.

+2  A: 

Actually, you can use (modified) RB-trees for this. Moreover, a modification of any balanced tree (not necessarily binary) will do.

The trick is to store additional information in each node - in your case, it could be the total weight of the subtree rooted at the node, or something like that.

When you update (ie. insert/delete) the tree, you follow the algorithm for your favorite tree. As you change the structure, you just recalculate the sums of the nodes (which is an O(1) operation for eg. rotations and B-tree splits and joins). When you change the weight of an item, you update the sums of the node's ancestors.

When you sample, you run a modified version of search. You get the sum of all weights in the trees (ie. sum of the root) and generate a positive random number lower than this. Then, you run the search algorithm, where you go to the left node iff the number (which is a quantile you search for) is less than the sum of the left node. If you go to the right node, you subtract the left sum from the quantile.

This description is a little chaotic, but I hope it helps.

jpalecek
Updating the sum of weights may get into difficulty if using inexact representations, such as floating point.
Pete Kirkham
This is not a big problem - the worst you can get are inaccurate results.
jpalecek
A: 

I like jpalecek's answer. I would add that the simplest way to get your random number would be (1) Generate u from a uniform(0,1) distribution. (2) Multiply u by the sum of all the weights in the tree.

leif
A: 

Since N is fixed, you can solve this by using an array, say v, where v[i+1] = v[i] + weight[i], v[1] = weight[0], v[0] = 0 and sample it by binary search, which is log(N), for a lower bound of a random number uniformly distributed between 0 and the sum of the weights.

Naive update of K items is O(KN), a more thoughtful one is O(N).

Spoiling yet another interview question by the circle of smug :)

obecalp
+1  A: 

This is a problem I had to solve for some Monte Carlo simulations. You can see my current `binary tree' at the link below. I have tried to make it fairly STL-like. My tree has a fixed capacity, which you could get round with a red-black tree approach, which I have talked about trying. It is essentially an implementation of the ideas described by jpacelek.

The method is very robust in coping with inexact floating point numbers, because it almost always sums and compares quantities of the same magnitude, because they are at the same level in the tree.

http://mopssuite.svn.sourceforge.net/viewvc/mopssuite/utils/trunk/include/binary_tree.hpp?view=markup

riap