tags:

views:

295

answers:

4

Hi everyone,

I would like to learn how to select weighted items. For example : I want to fetch questions from a pool but if some one can't give right answer to a question, that causes this question to double its weight and increase the probability of being selected again later on.

+2  A: 

Keep around an array of the candidate items. If one item has weight 2, put it in the array twice, generally if one has weight n put it in there n times. Then select a random element from the array. Ta-daaa.

varzan
Simple, but not very memory-efficient. May or may not be a good answer depending on how many questions maximum we're talking about.
Eric J.
This is rather inefficient - especially for larger weights. It's better to calculate sum of all weights, pick a random number in that interval and pick the last item where sum of all weights up to and including that item does not exceed selected random number.
ChssPly76
@ChssPly76: That's not necessarily "better", it really depends on the numbers we're talking about. It's really a trade-off between memory (varzan's solution) and CPU cycles (your solution... since you have to iterate through half the list of items on average to pick the right one). That may be nit-picky, but I think this is a good, easy problem for people to think about this classic trade-off between memory-efficiency and CPU-efficiency.
Eric J.
+1: Maybe not the best way to go due to poor efficiency, and imho, varzan could have pointed this out. None-the-less, sometimes quick and easy is the best way to go, and this could be so easy to code, every time a question is answered wrong add it to the list of questions. And why don't the complainers post a better solution?
tom10
But the data structure that @varzan suggestions is difficult to maintain. How do you modify the probabily of an existing question? That operation could require a full scan of the list. If question weights change frequently, then this approach may be neither simple nor efficient.
Todd Owen
Well, I must agree with all the comments about inefficiency. The technique that Andre Hoffman proposed would probably be better, but I'd still go with this one for a small number of questions (and by default I assume tiny usecases, my bad).
varzan
Well it's pretty nice and understandable but I guess it's inefficient.
Braveyard
+2  A: 

Have a look at this this(scroll down for the code).

EDIT for the critics:)

The code on this thread I linked shows how to implement a binary tree approach that actually works with weights and doesn't store masses of elements in an array to achieve a weighted probability. Then again, it's pretty inefficient when the weights change very often as the binary tree has to be re-created every time a weight changes.

EDIT2:

See Todd Owen's post about using self-balancing trees. The tree obviously does not have to be re-created every time a weight changes. That part just isn't included in the implementation I linked and needs to be added if your weights change a lot.

André Hoffmann
-1, link with no description, summary, or any relevant information about what's behind the link
Argalatyr
downvote withdrawn!
Argalatyr
Nice, that's a solution that's both space and time efficient.
Eric J.
+2  A: 

I like @André Hoffmann's idea of using a binary tree, in which every leaf node corresponds to a question, and every intermediate node stores the sum of the weight of its child nodes. But he says the tree needs to be re-created every time a weight changes.

Actually, this need not be the case! When you change the weight of a given leaf, you only need to update the weights of those nodes between it and the root of the tree. But...you also need some way to find the node within the tree, if you want to modify it.

So my suggestion is to use a self-balancing binary tree (e.g. a red-black tree, AVL tree, etc), which is ordered by the question ID. Operations on the tree need to maintain the property that the weight of any node is equal to the sum of the weights of its children.

With this data structure, the root node's weight W is equal to the sum of the weights of all the questions. You can retrieve a question either by question ID, or by a random weight (between zero and W). This operation, as well as insertions, deletions, or updating the weight of a question are all O(log n).

Todd Owen
I went to bed knowing that it is possible to use the algorithm without re-creating the tree. Thanks for pointing that out. Good call
André Hoffmann
+3  A: 

Have a class which keeps the item:weight pairs (key=item:value=weight) in a hash table.

The class should also maintain a total_weight variable, which is the sum of all the weights in the hash table. The class' methods to add_item, remove_item, and update_weight for an item should keep the total_weight updated. This avoids having to recalculate the total for every choice.

To choose an item: Use a random number such that 1<=random_number<=total_weight. Iterate over the item:weight pairs in the hash table, summing the weights until the random number is <= that running sum. When that happens, the key of the pair you're on is the chosen item.

This is like rolling an imaginary die whose size is the sum of all the weights. For every roll, each item has its own range of numbers on the die, with the size of each range equal to its item's weight. If the roll result falls within an item's range, that item is the chosen one.

Editing to add the following sample code after the request in the comment below. Tested this with Python 2.5.2:

from random import randint  # Import randint function from random module.

class WeightedCollection(object):
    def __init__(self):
        self.total_weight = 0
        self.items = {}  # This is a python dictionary == a hash table
    def add_item(self, item, weight):
        self.items[item] = weight
        self.total_weight += weight
    def remove_item(self, item):
        self.total_weight -= self.items[item]  # Subtracts the weight.
        del(self.items[item])
    def update_weight(self, item, new_weight):
        self.total_weight += (new_weight - self.items[item])
        self.items[item] = new_weight
    def get_random_item(self):
        ''' Returns random selection but weighted by item weights. '''
        # Result of call below is 1 <= random_number <= self.total_weight...
        random_number = randint(1, self.total_weight)
        sum_so_far = 0
        # For every item and its weight...
        for item, weight in self.items.iteritems():
            sum_so_far += weight
            if random_number <= sum_so_far:
                return item

# Usage demo...

questions = WeightedCollection()

questions.add_item('What is your name?', 1)
questions.add_item('What is your favorite color?', 50)
questions.add_item('What is the meaning to life?', 100)

print 'Here is what the dictionary looks like:'
print questions.items
print ''
print "Total weight:", questions.total_weight
print ''
print 'Some sample random picks...'
for i in range(5):
    print questions.get_random_item()

And here is the output:

Here is what the dictionary looks like:
{'What is the meaning to life?': 100, 'What is your name?': 1, 'What is your favorite color?': 50}

Total weight: 151

Some sample random picks...
What is your favorite color?
What is the meaning to life?
What is the meaning to life?
What is your favorite color?
What is the meaning to life?
Anon
Could you give me sample code written in C#, the best way for me to understand it to go thru on a sample code. Thanks.
Braveyard
I'm not a C# programmer, but I've added sample code in Python which hopefully will help, since Python is pretty easy to follow. Python has built-in dict's for hash tables, but in C#, you will probably have to look for something like a Hashtable class in a Collections library or some such. Others can perhaps speak to that, and to where to find the random number functions for C#.
Anon
Thank you very much.
Braveyard
You're welcome. To implement your larger program, rather than just using strings for the questions, you might well want to have a Question class (to include the correct answers in the question instances, for one thing), but I wanted to stay simple and focus on the weighted random choice issue in the sample code above.
Anon
Even though, I haven't ever used Python in my life but your code is pretty easy to understand.
Braveyard