tags:

views:

85

answers:

3

I have a sorted set (std::set to be precise) that contains elements with an assigned weight. I want to randomly choose N elements from this set, while the elements with higher weight should have a bigger probability of being chosen. Any element can be chosen multiple times.

I want to do this as efficiently as possible - I want to avoid any copying of the set (it might get very large) and run at O(N) time if it is possible. I'm using C++ and would like to stick to a STL + Boost only solution.

Does anybody know if there is a function in STL/Boost that performs this task? If not, how to implement one?

+1  A: 

I don't know about any libraries, but it sounds like you have a weighted roulette wheel. Here's a reference with some pseudo-code, although the context is related to genetic algorithms: http://www.cse.unr.edu/~banerjee/selection.htm

As for "as efficiently as possible," that would depend on some characteristics of the data. In the application of the weighted roulette wheel, when searching for the index you could consider a binary search instead. However, it is not the case that each slot of the roulette wheel is equally likely, so it may make sense to examine them in order of their weights.

Michael McGowan
+3  A: 

You need to calculate (and possibly cache, if you think of performance) the sum of all weights in your set. Then, generate N random numbers ranging up to this value. Finally, iterate your set, counting the sum of the weights you encountered so far. Inspect all the (remaining) random numbers. If the number falls between the previous and the next value of the sum, insert the value from the set and remove your random number. Stop when your list of random numbers is empty or you've reached the end of the set.

Alex Emelianov
Thanks, this seems to perform OK in my case and looks well.
dark_charlie
For the optimal performance, consider placing the random values in an ordered collection, and iterating it once instead of iterating it for every value of the source set. You don't have to remove values from the random collection, just increase the iterator.
Alex Emelianov
+1  A: 

A lot depends on the amount of extra storage you're willing to expend to make the selection faster.

If you're not willing to use any extra storage, @Alex Emelianov's answer is pretty much what I was thinking of posting. If you're willing use some extra storage (and possibly a different data structure than std::set) you could create a tree (like a set uses) but at each node of the tree, you'd also store the (weighted) number of items to the left of that node. This will let you map from a generated number to the correct associated value with logarithmic (rather than linear) complexity.

Jerry Coffin
Even though your algorithm is probably faster, I used Alex's answer as it doesn't seem to be a performance bottleneck and it was easier to implement :) Thanks for your answer.
dark_charlie