views:

957

answers:

4

Selecting without any weights (equal probabilities) is beautifully described here.

I was wondering if there is a way to convert this approach to a weighted one.

I am also interested in other approaches as well.

Update: Sampling without replacement

A: 

I used a associative map (weight,object). for example:

{
(10,"low"),
(100,"mid"),
(10000,"large")
}

total=10110

peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Pierre
The way I see it, the question is about selecting multiple items in a single pass (see the linked question). Your approach requires a pass for each selection.
Oak
A: 

In the question you linked to, Kyle's solution would work with a trivial generalization. Scan the list and sum the total weights. Then the probability to choose an element should be:

1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.

You can check that with everything having weight 1, this simplifies to kyle's solution.

Edited: (had to rethink what twice as likely meant)

Kyle Butt
let's say I have 4 elements in the list with weights {2, 1, 1, 1}. I am going to choose 3 from this list.According to your formula, for the first element 3* 2/5 = 1.2Which is >1What am I missing here?
nimcap
It's been corrected above, thanks.
Kyle Butt
now with {2,1,1,1} choose 3, the probability of picking the first element is 1 - (1 - (3/5))/2 = 1 - (2/5)/2 = 1 - 1/5 = 4/5, as expected.
Kyle Butt
I believe there is something wrong with your formula, I don't have time to write it now, I will when I have time. If you try to apply formula for the first two elements in different orders, you will see it won't produce the same results. It should provide the same results regardless of the order.
nimcap
say there are 4 elements with weights {7, 1, 1, 1} and we are going to pick 2. Lets calculate the chance of picking first 2 elements: P(1st)*P(2nd) = (1-(1-2/10)/7)*(1-(1-1/3)) = 31/105. Let's change the list to {1, 7, 1, 1}, probability of picking the first 2 elements should remain same P(1st)*P(2nd) = (1-(1-2/10))*(1-(1-1/9)/7) = 11/63. They are not same.
nimcap
Even simpler, say there are 2 elements with weights {1/2, 1/2} and we are going to pick 2. The probability should be 1; we have to take both. But the formula gives 1 - (1 - (2/1)))/(1/2) = 3.
Jason Orendorff
Unfortunately I don't see an easy way to simplify calculating the real probability, which is P(a, n) = if n == 0 then 0 else 1 - sum i=1..|a| of (weight(a[i]) / total_weight(a) * (1 - P(a omitting a[i], n - 1))).
Jason Orendorff
+8  A: 

If the sampling is with replacement, you can use this algorithm (implemented here in Python):

import random

items = [(10, "low"),
         (100, "mid"),
         (890, "large")]

def weighted_sample(items, n):
    total = float(sum(w for w, v in items))
    i = 0
    w, v = items[0]
    while n:
        x = total * (1 - random.random() ** (1.0 / n))
        total -= x
        while x > w:
            x -= w
            i += 1
            w, v = items[i]
        w -= x
        yield v
        n -= 1

This is O(n + m) where m is the number of items.

Why does this work? It is based on the following algorithm:

def n_random_numbers_decreasing(v, n):
    """Like reversed(sorted(v * random() for i in range(n))),
    but faster because we avoid sorting."""
    while n:
        v *= random.random() ** (1.0 / n)
        yield v
        n -= 1

The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.

This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.

If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.

Here's an implementation of that, plentifully commented:

import random

class Node:
    # Each node in the heap has a weight, value, and total weight.
    # The total weight, self.tw, is self.w plus the weight of any children.
    __slots__ = ['w', 'v', 'tw']
    def __init__(self, w, v, tw):
        self.w, self.v, self.tw = w, v, tw

def rws_heap(items):
    # h is the heap. It's like a binary tree that lives in an array.
    # It has a Node for each pair in `items`. h[1] is the root. Each
    # other Node h[i] has a parent at h[i>>1]. Each node has up to 2
    # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple
    # arithmetic, we have to leave h[0] vacant.
    h = [None]                          # leave h[0] vacant
    for w, v in items:
        h.append(Node(w, v, w))
    for i in range(len(h) - 1, 1, -1):  # total up the tws
        h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent
    return h

def rws_heap_pop(h):
    gas = h[1].tw * random.random()     # start with a random amount of gas

    i = 1                     # start driving at the root
    while gas > h[i].w:       # while we have enough gas to get past node i:
        gas -= h[i].w         #   drive past node i
        i <<= 1               #   move to first child
        if gas > h[i].tw:     #   if we have enough gas:
            gas -= h[i].tw    #     drive past first child and descendants
            i += 1            #     move to second child
    w = h[i].w                # out of gas! h[i] is the selected node.
    v = h[i].v

    h[i].w = 0                # make sure this node isn't chosen again
    while i:                  # fix up total weights
        h[i].tw -= w
        i >>= 1
    return v

def random_weighted_sample_no_replacement(items, n):
    heap = rws_heap(items)              # just make a heap...
    for i in range(n):
        yield rws_heap_pop(heap)        # and pop n items off it.
Jason Orendorff
+1 for awesome use of Python control structure variations that I haven't seen before
Sparr
See my answer to another question for a Python implementation of the binary-tree method: http://stackoverflow.com/questions/526255/probability-distribution-in-python/526843#526843
Darius Bacon
Darius Bacon: Upvoted! While I was fiddling with this I found that you don't need a tree. It can be done with a heap. So I added an implementation of my own to this answer.
Jason Orendorff
Nice idea with the heap! I found the *name* 'heap' misleading for what you did, though, since that's usually taken to mean a structure that obeys the heap invariant on the weights. Here the heap invariant does apply, but to the .tw field and it's incidental to how your code works, if I understand it right.
Darius Bacon
BTW it looks like you could avoid leaving a 0 in the array by moving the array's last entry into its place, updating the .tw's back up to the root from the old and new positions. This doesn't appear worth it for this static problem, but it would for a dynamic one like my post was about. I'll update that post to point here.
Darius Bacon
Yes, I agree with both those comments. (I couldn't think of a better name for it than "heap", though. I guess it might be better call it a binary tree, and treat the array as an implementation detail. But that's confusing too!)
Jason Orendorff
There's a standard structure something like your heap known as the Fenwick tree: http://en.wikipedia.org/wiki/Fenwick_tree -- from a quick skim it doesn't sound the same, but I didn't really look into it.
Darius Bacon
+11  A: 

If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):

  1. sort the weights
  2. compute the cumulative weights
  3. pick a random number in [0,1]*totalWeight
  4. find the interval in which this number falls into
  5. select the elements with the corresponding interval
  6. repeat k times

alt text

If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)

Amro
+1, this wins big on clarity. But note that the roulette-wheel algorithm takes O(*n* log *m* + *m*) time, where *n* is the number of samples and *m* is the number of items. That's if you omit the sorting, which is unnecessary, and do a binary search in step 4. Also, it requires O(*m*) space for the cumulative weights. In my answer there's a 14-line function that does the same thing in O(*n* + *m*) time and O(1) space.
Jason Orendorff
If I have to remove selected element I need to copy the whole list, I am assuming we are not allowed to do any modification on the input list, which is expensive.
nimcap