views:

136

answers:

1

Does anyone know of an algorithm or data structure relating to selecting items, with a probability of them being selected proportional to some attached value? In other words: http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Probability_proportional_to_size_sampling

The context here is a decentralized reputation system and the attached value is therefore the value of trust one user has in another. In this system all nodes either start as friends which are completely trusted or unknowns which are completely untrusted. This isn't useful by itself in a large P2P network because there will be many more nodes than you have friends and you need to know who to trust in the large group of users that aren't your direct friends, so I've implemented a dynamic trust system in which unknowns can gain trust via friend-of-a-friend relationships.

Every so often each user will select a fixed number (for the sake of speed and bandwidth) of target nodes to recalculate their trust based on how much another selected fixed number of intermediate nodes trust them. The probability of selecting a target node for recalculation will be inversely proportional to its current trust so that unknowns have a good chance of becoming better known. The intermediate nodes will be selected in the same way, except that the probability of selection of an intermediary is proportional to its current trust.

I've written up a simple solution myself but it is rather slow and I'd like to find a C++ library to handle this aspect for me. I have of course done my own search and I managed to find TRSL which I'm digging through right now. Since it seems like a fairly simple and perhaps common problem, I would expect there to be many more C++ libraries I could use for this, so I'm asking this question in the hope that someone here can shed some light on this.

+1  A: 

This is what I'd do:

int select(double *weights, int n) {
    // This step only necessary if weights can be arbitrary
    // (we know total = 1.0 for probabilities)
    double total = 0;
    for (int i = 0; i < n; ++i) {
        total += weights[i];
    }

    // Cast RAND_MAX to avoid overflow
    double r = (double) rand() * total / ((double) RAND_MAX + 1);
    total = 0;
    for (int i = 0; i < n; ++i) {
        // Guaranteed to fire before loop exit
        if (total <= r && total + weights[i] > r) {
            return i;
        }

        total += weights[i];
    }
}

You can of course repeat the second loop as many times as you want, choosing a new r each time, to generate multiple samples.

j_random_hacker
If all the OP is after is a weighted select, then the random library from TR1 already provides the means to create discrete distributions efficiently.
dman
For your first loop to exit you need `unsigned int n`. And since your second loop seems to calculate something like the cumulative probability distribution shouldn't you properly normalize `sum(weights)` and `return round(i)`?
honk
@darid: I didn't know about that library, you should mention that in a separate answer!
j_random_hacker
@honk: ??? The "correct" type for `n` and `i` is `size_t`, which may or may not be `unsigned int`. I just used `int` to keep things informal. Your 2nd sentence is just as puzzling -- all the 2nd loop does is decide which "bucket" a uniformly-distributed random number falls into. `i` is an integer, so what's the sense in rounding it?
j_random_hacker
yes, I got confused on (2), you just return the index. For (1), if somebody blindly copies this, they got a bug.
honk
No, it's perfectly reasonable to state as a precondition of this subroutine that it will work only with arrays up to this size. You might as well criticise me for using `rand()` at all, since it's not threadsafe. These things are *reasonable caveats* for a code snippet that's just designed to get the message across without complicating things overly.
j_random_hacker
@j_random_hacker: Thanks for that. However I was hoping for a faster (likely tree-based) algorithm in which I could place all the nodes for selection into a data structure and select from it multiple times efficiently.
Stephen Cross
@darid: Thanks for the library.
Stephen Cross
@Stephen: You can do that efficiently by generating a vector of `r` values and then sorting them in increasing order (O(mlog(m)) time for a vector of size m). Then, the 2nd loop starts scanning through looking for what bucket the first `r` value lands in. When it finds it, add that index to the vector of samples found, and start looking for the next one from that same position. (Remember to back up `i` by 1 to allow for multiple samples hitting the same bucket.)
j_random_hacker
@j_random_hacker: That sounds like a great idea, thanks for your help.
Stephen Cross
No problem :) You can +1 my answer if you want (hint hint... :-P)
j_random_hacker
Once I have the necessary reputation, I will :)
Stephen Cross
and...done :) I'll mark your answer as the accepted answer if no better answer appears in the next few days
Stephen Cross