ansaurus

Question

Weighted random selection with and without replacement

Answer 1

+1 A:

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.

If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

John D. Cook 2008-12-09 13:43:46

I just took a look at section 3.4.2, and it covers only unbiased selection with and without replacement - there's no mention made of weighted selection.

Nick Johnson 2008-12-09 16:25:50

Answer 2

+1 A:

Here's what I came up with for weighted selection without replacement:

def WeightedSelectionWithoutReplacement(l, n):
  """Selects without replacement n random elements from a list of (weight, item) tuples."""
  l = sorted((random.random() * x[0], x[1]) for x in l)
  return l[-n:]

This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.

Here's what I came up with for weighted selection with replacement:

def WeightedSelectionWithReplacement(l, n):
  """Selects with replacement n random elements from a list of (weight, item) tuples."""
  cuml = []
  total_weight = 0.0
  for weight, item in l:
    total_weight += weight
    cuml.append((total_weight, item))
  return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]

This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

Nick Johnson 2008-12-09 17:06:26

That first function is brilliant, but alas it doesn't weight the items correctly. Consider `WeightedSelectionWithoutReplacement([(1, 'A'), (2, 'B')], 1)`. It will choose A with probability 1/4, not 1/3. Hard to fix.

Jason Orendorff 2010-01-28 14:27:29

Btw, faster but more complex algorithms are in my answer here: http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights/2149533#2149533

Jason Orendorff 2010-01-28 14:43:24

Answer 3

+2 A:

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

To create the alias lookup:

Normalize the weights such that they sum to 1. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that 'a' fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

For example, if we run another iteration of 3 and 4, we see

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned

At runtime:

Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it three, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5<0.6, so return a.

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

John the Statistician 2008-12-09 17:27:01

Sorry to say your explanation here is a bit unclear, but the linked page describes it in more detail - thanks! It certainly is a rather cool algorithm, and seems like it fits the bill. :)

Nick Johnson 2008-12-09 22:53:18

Yes, I was at work, and I wanted to hammer it out. I'll expand on this in the near future. Glad you found it helpful, even so. :)

John the Statistician 2008-12-09 23:39:34

Much better, thanks!

Nick Johnson 2008-12-10 10:32:04

@John, after running another iteration of 3 and 4, isn't 0 left in a, since you assigned it all to p2?

Eli Bendersky 2010-01-23 13:53:29

@Eli, right you are, now hopefully fixed.

John the Statistician 2010-01-25 18:46:37

ansaurus

tags:

views:

answers:

Weighted random selection with and without replacement

related questions