views:

170

answers:

6

I wish to select a random word from a list where the is a known chance for each word, for example:

Fruit with Probability

Orange 0.10 Apple 0.05 Mango 0.15 etc

How would be the best way of implementing this? The actual list I will take from is up to 100 items longs and the % do not all tally to 100 % they do fall short to account for the items that had a really low chance of occurrence. I would ideally like to take this from a CSV which is where I store this data. This is not a time critical task.

Thank you for any advice on how best to proceed.

A: 

One solution is to normalize the probabilities to integers and then repeat each element once per value (e.g. a list with 2 Oranges, 1 Apple, 3 Mangos). This is incredibly easy to do (from random import choice). If that is not practical, try the code here.

Brian
A: 
import random
d= {'orange': 0.10, 'mango': 0.15, 'apple': 0.05}
weightedArray = []
for k in d:
  weightedArray+=[k]*int(d[k]*100)
random.choice(weightedArray)

EDITS

This is essentially what Brian said above.

Mark
this is very inefficient in general, and likely inaccurate as soon as the probabilities aren't nice fractions.
Peter
+1  A: 
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]

x = 0.0
lst2 = []
for fruit, chance in lst:
    tup = (x, fruit)
    lst2.append(tup)
    x += chance

tup = (x, None)
lst2.append(tup)

import random

def pick_one(lst2):
    if lst2[0][1] is None:
        raise ValueError, "no valid values to choose"
    while True:
        r = random.random()
        for x, fruit in reversed(lst2):
            if x <= r:
                if fruit is None:
                    break  # try again with a different random value
                else:
                    return fruit

pick_one(lst2)

This builds a new list, with ascending values representing the range of values that choose a fruit; then pick_one() walks backward down the list, looking for a value that is <= the current random value. We put a "sentinel" value on the end of the list; if the values don't reach 1.0, there is a chance of a random value that shouldn't match anything, and it will match the sentinel value and then be rejected. random.random() returns a random value in the range [0.0, 1.0) so it is certain to match something in the list eventually.

The nice thing here is that you should be able to have one value with a 0.000001 chance of matching, and it should actually match with that frequency; the other solutions, where you make a list with the items repeated and just use random.choice() to choose one, would require a list with a million items in it to handle this case.

steveha
It might be helpful to add some comments to describe what this code is doing, and why it works. Essentially, each item is given a "probability range", defined by its value in `lst2[x][0]` as the upper bound, and `lst2[x-1][0]` as its lower bound. Each element in the list is checked against the output from `random.random()`. One question: why the `while True` loop?
dcrosta
The `while True:` loop is essential to handle the case where `fruit is None`. The problem stated that the chances might not add up to 100%, and my sample data only adds up to 0.99, so there is a 1% chance of matching the sentinel value. We break out of the loop over lst2, and try again in the `while True:` loop, getting a new random value.
steveha
Now bearing in mind I am only 2 days into python or any programming (i.e this is the stupid disclaimer) does this start at the end of the list and test whether that fruit is the one to pick and if it isn't move onto the next smallest (or rather testing right from left).
MDA1973
pick_one() looks at the values in lst2, using a `for` loop. The loop will look at the values in the list in order. However, I put `reversed()` around the list, which means the `for` loop will look at the list in reversed order. In my example, 'Apple' has a 0.05 chance of being chosen; the fruit after 'Apple', 'Mango', has a value of 0.15 and 'Apple' itself has a value of 0.10. Thus values less than 0.15 and greater than or equal to 0.10 should match 'Apple': we want 0.10 <= r < 0.15 So by walking backward through the list, comparing values for <=, we get this range matching behavior.
steveha
I posted another answer, somewhat simpler than this one but doing basically the same thing. It saves explicit ranges, rather than making ascending values serve as ranges.
steveha
+2  A: 

You can pick items with weighted probabilities if you assign each item a number range proportional to its probability, pick a random number between zero and the sum of the ranges and find what item matches it. The following class does exactly that:

from random import random

class WeightedChoice(object):
    def __init__(self, weights):
        """Pick items with weighted probabilities.

            weights
                a sequence of tuples of item and it's weight.
        """
        self._total_weight = 0.
        self._item_levels = []
        for item, weight in weights:
            self._total_weight += weight
            self._item_levels.append((self._total_weight, item))

    def pick(self):
        pick = self._total_weight * random()
        for level, item in self._item_levels:
            if level >= pick:
                return item

You can then load the CSV file with the csv module and feed it to the WeightedChoice class:

import csv

weighed_items = [(item,float(weight)) for item,weight in csv.reader(open('file.csv'))]
picker = WeightedChoice(weighed_items)
print(picker.pick())
Ants Aasma
Thank you very much this works exactly as I envisioned it.
MDA1973
Using `bisect` in `pick()` method will increase it's speed for long lists (O(log n) instead of O(n)).
Denis Otkidach
That depends on the distribution of input data. Sorting the weights largest first will reduce the amount of checks. If the weights are heavily skewed then a linear search might have a better average case than bisecting from the middle.
Ants Aasma
A: 
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]

x = 0.0
lst2 = []
for fruit, chance in lst:
    low = x
    high = x + chance
    tup = (low, high, fruit)
    lst2.append(tup)
    x += chance

if x > 1.0:
    raise ValueError, "chances add up to more than 100%"

low = x
high = 1.0
tup = (low, high, None)
lst2.append(tup)

import random

def pick_one(lst2):
    if lst2[0][2] is None:
        raise ValueError, "no valid values to choose"
    while True:
        r = random.random()
        for low, high, fruit in lst2:
            if low <= r < high:
                if fruit is None:
                    break  # try again with a different random value
                else:
                    return fruit

pick_one(lst2)


# test it 10,000 times
d = {}
for i in xrange(10000):
    x = pick_one(lst2)
    if x in d:
        d[x] += 1
    else:
        d[x] = 1

I think this is a little clearer. Instead of a tricky way of representing ranges as ascending values, we just keep ranges. Because we are testing ranges, we can simply walk forward through the lst2 values; no need to use reversed().

steveha
Thank you I think one thing your solution has highlighted to me which I don't think I had appreciated to now is the necessity to ensure that errors and things I don't expect to happen are covered. This is something I haven't really thought of yet and feel like it is something I should look into more as I learn Python.
MDA1973
+1  A: 

What you want is to draw from a multinomial distribution. Assuming you have two lists of items and probabilities, and the probabilities sum to 1 (if not, just add some default value to cover the extra):

def choose(items,chances):
    import random
    p = chances[0]
    x = random.random()
    i = 0
    while x > p :
     i = i + 1
     p = p + chances[i]
    return items[i]
job
+1. Clear, simple code.
Kylotan