views:

47

answers:

4

Hi guys, I have a dictionary(python) with key->value (str->int). I have to chose some key due to it's own value. Then bigger this value the key has less posibility to be chosen. For example if key1 has value 2 and key2->1 key1 the attitude should be 2:1.

How can I do this?

+1  A: 

It the values are not too large, you can do it this way

>>> from random import choice
>>> d={"key1":2,"key2":1}
>>> c=[]
>>> for k,v in d.items():
...  c+=[k]*v
... 
>>> choice(c)
'key1'
>>> sum(1 for x in range(100) if choice(c)=="key1")
63
>>> sum(1 for x in range(100) if choice(c)=="key2")
36
gnibbler
+2  A: 

If the values are too large for gnibler's approach:

Build a list of tuples (key, index), where index is the sum of all values that come before key in the list (this would be the index of the first occurrence of key gnibler's list c. Also calculate the sum of all values (n).

Now, generate a random number xbetween 0 and n - 1. Find the last entry in the list with index < x. Since the list is sorted by index, you can use binary search to do that efficiently.

Update: KennyTM's code is an implementation of this, except that he uses a brute-force linear search instead of binary search; this will be inefficient if the number of keys are large.

oefe
+1. This is known as Roulette Wheel Selection or Fitness Proportional Selection, and is commonly used in genetic algorithms.
Dave Kirby
+1  A: 

1. Construct a CDF-like list like this:

def build_cdf(distrib):
    cdf = []
    val = 0
    for key, freq in distrib.items():
        val += freq
        cdf.append((val, key))
    return (val, cdf)

This function returns a tuple, the 1st value is the sum of probabilities, and 2nd value is the CDF.

2. Construct the sampler like this:

import random
def sample_from_cdf(val_and_cdf):
    (val, cdf) = val_and_cdf;
    rand = random.uniform(0, val)
    # use bisect.bisect_left to reduce search time from O(n) to O(log n).
    return [key for index, key in cdf if index > rand][0]

Usage:

x = build_cdf({"a":0.2, "b":0.3, "c":0.5});
y = [sample_from_cdf(x) for i in range(0,100000)];
print (len([t for t in y if t == "a"]))   # 19864
print (len([t for t in y if t == "b"]))   # 29760
print (len([t for t in y if t == "c"]))   # 50376

You may want to make this into a class.

KennyTM
A: 

A quick and simple version of the algorithm from oefe's and KennyTM's answers:

def select_weighted(d):
   offset = random.randint(0, sum(d.itervalues())-1)
   for k, v in d.iteritems():
      if offset < v:
         return k
      offset -= v
sth