ansaurus

Question

Answer 1

+1 A:

It the values are not too large, you can do it this way

>>> from random import choice
>>> d={"key1":2,"key2":1}
>>> c=[]
>>> for k,v in d.items():
...  c+=[k]*v
... 
>>> choice(c)
'key1'
>>> sum(1 for x in range(100) if choice(c)=="key1")
63
>>> sum(1 for x in range(100) if choice(c)=="key2")
36

gnibbler 2010-02-21 10:24:24

Answer 2

+2 A:

If the values are too large for gnibler's approach:

Build a list of tuples (key, index), where index is the sum of all values that come before key in the list (this would be the index of the first occurrence of key gnibler's list c. Also calculate the sum of all values (n).

Now, generate a random number xbetween 0 and n - 1. Find the last entry in the list with index < x. Since the list is sorted by index, you can use binary search to do that efficiently.

Update: KennyTM's code is an implementation of this, except that he uses a brute-force linear search instead of binary search; this will be inefficient if the number of keys are large.

oefe 2010-02-21 10:36:44

+1. This is known as Roulette Wheel Selection or Fitness Proportional Selection, and is commonly used in genetic algorithms.

Dave Kirby 2010-02-21 10:39:38

Answer 3

+1 A:

1. Construct a CDF-like list like this:

def build_cdf(distrib):
    cdf = []
    val = 0
    for key, freq in distrib.items():
        val += freq
        cdf.append((val, key))
    return (val, cdf)

This function returns a tuple, the 1st value is the sum of probabilities, and 2nd value is the CDF.

2. Construct the sampler like this:

import random
def sample_from_cdf(val_and_cdf):
    (val, cdf) = val_and_cdf;
    rand = random.uniform(0, val)
    # use bisect.bisect_left to reduce search time from O(n) to O(log n).
    return [key for index, key in cdf if index > rand][0]

Usage:

x = build_cdf({"a":0.2, "b":0.3, "c":0.5});
y = [sample_from_cdf(x) for i in range(0,100000)];
print (len([t for t in y if t == "a"]))   # 19864
print (len([t for t in y if t == "b"]))   # 29760
print (len([t for t in y if t == "c"]))   # 50376

You may want to make this into a class.

KennyTM 2010-02-21 10:37:44

Answer 4

A:

A quick and simple version of the algorithm from oefe's and KennyTM's answers:

def select_weighted(d):
   offset = random.randint(0, sum(d.itervalues())-1)
   for k, v in d.iteritems():
      if offset < v:
         return k
      offset -= v

sth 2010-02-21 11:13:03

ansaurus

tags:

views:

answers:

Sampling keys due to their values

related questions