views:

312

answers:

6

I have a list of items. When I create the list each item has equal chance to be selected. But as an item is selected its chance goes down while the others chance goes up. If a new item is added during the process, it should have the highest chance to be selected with its chances falling off as it is selected. I am looking for a good algorithm that can accomplish this is C#.

Generalizaed idea: I have 5 items, over time all 5 items will be selected 20% of time time. I am trying to keep the selections a close to that 20% as possible, cutting down on outlyers. If one exists it will be selected more/less to bring it back in line.

+1  A: 

You could do something such as make a custom class (for your list), that stores the Item plus a weighting.

Default the weighting to 1 when you add an item, and store (in the "list") the total of all of the weightings of all items.

You could then, when you select an item at random, just pick a number between 0 -> total weightings of all items in the list, and walk through the list to find the item at that "position" (by weighting). Decrement the weighting of that item (this can be some falloff, ie: multiply it's weighting by 0.8, or 0.5 - you'd have a lot of control over how fast it's probability of select falls off), and return it.

The downside here, if you have lots of items, is that your selection is O(n) (since you have to walk through the list). Because of this, I'd probably use a linked list for the storage (you're going to have to walk for retrieval anyways, so this gives you faster insertion/removal).

If you're not storing a huge number of options, though, this would be very easy to implement, and give you a lot of control over probabilities plus probability reduction at selection time.

Reed Copsey
I would probably store weight as an integer, starting at 100 or something, and decrement, if only to make the rounded-off index cases more obvious, but this was pretty much my answer as well.
Mikeb
In addition to the O(n) retrieval time, this algorithm has two additional (potential) deficits. 1) it requires additional memory storage to implement, to store weights., 2) weighted items may not have the same probability of being selected even if they have the same weight since (as described) the algorithm finds the first item with a weight that fits the search.
LBushkin
Yeah, weight as integer makes sense. It's probably a better approach.
Reed Copsey
@LBushkin: I was actually thinking you'd select from 0-> total weights, and find the position at that index, not the first element that has a specific probability. You'd have equal chance of finding any element, by probability, that way. The extra storage is a real downside, though.
Reed Copsey
If it's a long running process, what do you do when the weights of all items get to 0?Storing one extra integer per choice doesn't seem like a major downside to me.
digitaljoel
@digitaljoel: That's one advantage to my original approach using floating points values for weightings. You'll never get a 0 probability if you're scaling the weighting by a factor.
Reed Copsey
A: 

Use the elapsed time since the item was inserted or last selected as the priority mofifier... Set each items priority = amount of time since it has been inserted or last selected, and then adjust it's chance to be selected by multiplying by this priority.

Once you have all the items chances, normalize them, (adjsut them all by the same calculated ratio to get them all to add up to 1.000), then use those values as their probability of being selected.

Charles Bretana
He didn't want priorities to fall off by time, but rather by how often they were selected. How does using elapsed time help that?
Reed Copsey
Where does he say that? He just says that the chance should fall off when it is selected, not by "how often" it is selected...
Charles Bretana
+4  A: 

Use a bucket weighted queue: Instead of using a list, divide your collection into buckets - each bucket has an associated frequency of retrieval. Items move from higher frequency buckets to lower frequency ones as they are selected.

An easy way to implement this is to assign each bucket a range of values, and generate a random number within the combined range to pick a bucket to choose from. You'll probably want to abstract this collection into some kind of class so that you don't expose consumers to the details.

Algorithm:

Initially, all items start in the same (top-level) bucket.

When you select an item, move it from the bucket it's in, to the next lower bucket. If necessary, create the next level bucket.

When a new item is added, add it to the top most (most frequently used) bucket.

To randomly select an item, first pick a bucket, then pick an item within the bucket. Move the selected item down into the next level bucket. Note, that moving an item down to a lower frequency bucket is optional - you can set some cutoff point.

When creating a new bucket, update the retrieval range associated with all buckets to give the new set the frequency distribution characteristic you want.

C# implementation (first cut) of a generic bucketed weighted queue:

using System;
using System.Collections.Generic;
using System.Linq;

namespace Utility
{
    public class BucketWeightedQueue<T>
    {
        private int m_MaxFallOff;
        private readonly double m_FallOffRate;
        private readonly List<List<T>> m_Buckets;
        private readonly List<int> m_FallOffFactors;
        private readonly Random m_Rand = new Random();

        public BucketWeightedQueue(IEnumerable<T> items, double fallOffRate )
        {
            if( fallOffRate < 0 ) 
                throw new ArgumentOutOfRangeException("fallOffRate");

            m_MaxFallOff = 1;
            m_FallOffRate = fallOffRate;
            m_Buckets = new List<List<T>> { items.ToList() };
            m_FallOffFactors = new List<int> { m_MaxFallOff };
        }

        public T GetRandomItem()
        {
            var bucketIndex = GetRandomBucketIndex();
            var bucket = m_Buckets[bucketIndex];
            var itemIndex = m_Rand.Next( bucket.Count );
            var selectedItem = bucket[itemIndex];

            ReduceItemProbability( bucketIndex, itemIndex );
            return selectedItem;
        }

        private int GetRandomBucketIndex()
        {
            var randFallOffValue = m_Rand.Next( m_MaxFallOff );
            for (int i = 0; i < m_FallOffFactors.Count; i++ )
                if( m_FallOffFactors[i] <= randFallOffValue )
                    return i;
            return m_FallOffFactors[0];
        }

        private void ReduceItemProbability( int bucketIndex, int itemIndex )
        {
            if( m_FallOffRate <= 0 )
                return; // do nothing if there is no fall off rate...

            var item = m_Buckets[bucketIndex][itemIndex];
            m_Buckets[bucketIndex].RemoveAt( itemIndex );

            if( bucketIndex <= m_Buckets.Count )
            { 
                // create a new bucket...
                m_Buckets.Add( new List<T>() );
                m_MaxFallOff = (int)(m_FallOffFactors[bucketIndex] / m_FallOffRate);
                m_FallOffFactors.Add( m_MaxFallOff );
            }

            var nextBucket = m_Buckets[bucketIndex + 1];
            nextBucket.Add( item );

            if (m_Buckets[bucketIndex].Count == 0) // drop empty buckets
                m_Buckets.RemoveAt( bucketIndex );
        }
    }
}
LBushkin
I think this is closest to questioner's requirement, why no up votes?
Bill Yang
I like the idea of the data structure encapsulating the rules. Nice work.
Paul Suart
+1  A: 

It all depends how the probablility of being selected should vary when a given item gets selected.

A simple way to achieve this is with double drawring, whereby you start with the input list (which never changes) and an empty list to which you add the items selected at random. After each normal draw, (from first list), you draw one item out of the second list. If the same item (or rather item value) comes up again, it is not selected (i.e. invalid draw, start anew), otherwise the draw counts and the corresponding value is added to second list.

The general concept seems related to ergodic sources.

DigitalJoe commented on an apparent drawback to this approach. In a nutshell, Joe noted that the probability of preventing the repeating (not necessarily consecutive repeats, just "repeating") of a previously drawn item value (a value found in the second list of the algorithm) fluctuates greatly in the first few hundred drawings. Another implicit commentary was that after the second list contains a few dozen values, the probability of preventing such duplication is extremely low. These are both valid points, and require qualification.

This reasoning matches our intuitive understanding of the way the second list works: the more items values are in there, the least chances we have to "double draw", hence to prevent a repeat. This is true but it only focuses on the second list. The probability of drawing [from the first list] an item previously seen needs to be taking into account. We can use two examples to intuit this idea, but this is of course a gradient:

  • Case "A" :the probability of drawing a given item value is be relatively low compared with the probability of drawing something else. The combined probability of drawing this item multiple times during the first few drawings is therefore low, and the probability of doing so and then of discarding it due to the drawing(s) in list two is even lower (even though the latter probability may be high, due to the small number of item in the second list).
  • Case "B": the probability of drawing a given item is high relatively to the probability of drawing other items. In that case, the probability of having repeats in the first few draws is high, and since the probability of preventing an actual repeat (with the drawing in second list) is high as well (a certainty for the 2nd drawing, a 50% chance on second drawing... a 10% chance on the 11th drawing), the overall probability of "punishing" a highly probable item is high. This is however not a problem, because, the relative probability to drawing the item from the first list will ensure that there will be, statistically, many other opportunities to produce duplicates that will not be so "forcefully" suppressed as the number of drawings progresses.

In both cases, what matters is that the overall distribution of the actual drawings matches that of the distribution in the input list. This is particularly true as the number of drawings becomes more statistically significant.

On the question of the possibility of having too weak of a repeat filter, that too gets to be less relevant as the second list reflects more and more the distribution of the first list. Maybe a way to get a feel for all this is to consider alternative ways of achieving the goals described in the OP's question.

Alternative algorithms:
Algo 1: Drawing without replacement from the first list. That we'd use a copy of the original list to start, and each time a given item is drawn, we'd remove it from that list, hence making it overall less probably that the same value would show up again. By the time we 'd drawn all of the items, we'd reproduced exactly the distribution of the original list.
Algo 2:Drawing without replacement, from a list where the original list has been replicated a given number of times. This is similar to the above, but introduces a bit more randomness, i.e. by requiring a bigger number of drawings to have the drawings distrubution approach and match that of the original list.

In a way, the two list algorithm I suggested originally (Draw with replacement from original list and manage a second list to sometimes prevent repeats) is similar to Algo 2, in the way these makes the drawing distribution converge towards that of the original list. The advantage of the original algorithm however is that it makes the management of the lists easier (although in fairness a simple way to do so is to replace drawn items with a "null" value of sorts, and to re-draw then hitting such an "empty cell", which effectively is the same thing, in reverse, as to redrawing when the drawing in the second list produces the same item value..)

mjv
I like this approach. It takes care of the decreasing chance of selection as an item is selected, and assuming you are putting duplicates in the previously drawn list, it further reduces the chance of selection every time an item is selected. One problem I see is that the penalty for being chosen is greatly reduced as the number of selections is made. On the second draw, the chance of repeating the previous item is 0. On the 11th draw, the chance of repeating the previous item is only 10% less likely than getting a previously unselected item. 1% at the 101st, etc.
digitaljoel
@digitaljoe I don't follow exactly the figures you quote for probabilites. Nevertheless, one can offset this problem by introducing additional attempts at disqualifying a given draw (i.e. drawing say 3 or 4 times from the "previously drawn" list) There comes a times (and this varies with the number of distinct items, and with their probability of being drawn in the first place), when the previously drawn list constitutes a workable tool to ensure the variability associated with individual draws, while respecting the desired distribution associated with the data)
mjv
I am not a statistician, but here was my thinking. On the first draw, you put the item in list 2. On the second draw, if you draw the first item again, it is the only option in list 2 so it is automatically drawn from list 2 and invalidates the pick from list 1. On the 11th draw, list 2 now has 10 items in it, therefore, only a 10% chance of choosing the item from list 2 that was chosen in the previous trial from list 1, so the chance of invalidating the draw, even though it is the same item that was drawn previously, is only 10%, not the 100% that we had in draw 2.
digitaljoel
@digitaljoe. Your reasoning is correct, in part, see my [long] edit in the text, which shows how the big variations in the probability of filtering out item values in the early stages of the drawing do not affect the ability of the algorithm to produce drawings which are "true" to the distribution found in the orginal list.
mjv
A: 

the general strategy for picking a weighted random element from a list is this: give each item a weight. normalise, so that the total weight is 1. (so to start with, every item has weight 1/n). sort your list by weight. now pick a random number between 0 and 1, and go down the list accumulating totals till you cross your number. e.g. if your weights are [0.4, 0.3, 0.2, 0.1] and your random number is 0.63215, your first two steps have total = 0.4, total = 0.7 and then notice that 0.7 is greater than 0.63215 so you return the second element.

once you pick an element, you adjust its weighting downwards (you'll need to experiment with downweighting formulas till you find one that works for you, the simplest is just to multiply it by some constant fraction each time), and then renormalise and repeat.

note that this is fairly inefficient if you have a lot of elements since it is O(n) in the length of the list, but it shouldn't matter in practice unless you're doing this in an inner loop that needs to be heavily optimised, or similar. if it turns out to be an issue you could look into geometric data structures like range trees to optimise the lookup.

Martin DeMello
+4  A: 

Here we will engineer a random number generator that has a distribution that favors low values. You can use it to prefer items at the beginning of a list. To decrease the odds of something being selected, move that item down the list. You have a few options for how you want to move the item down the list. Lets review the random variable transformation first.

By applying the following function to a uniform random variable between 0 and 1:

index = Int(l*(1-r^(0.5)) # l=length, r=uniform random var between 0 and 1

You get a cool distribution that drastically reduces the odds of a larger index

p(0)=0.09751
p(1)=0.09246
p(2)=0.08769
p(3)=0.08211
p(4)=0.07636
p(5)=0.07325
p(6)=0.06772
p(7)=0.06309
p(8)=0.05813
p(9)=0.05274
p(10)=0.04808
p(11)=0.04205
p(12)=0.03691
p(13)=0.03268
p(14)=0.02708
p(15)=0.02292
p(16)=0.01727
p(17)=0.01211
p(18)=0.00736
p(19)=0.00249

Here is the distribution for a list of size 2

0.75139
0.24862

Size 3

0.55699
0.33306
0.10996

Size 4

0.43916
0.31018
0.18836
0.06231

Now lets discuss the two options for moving the items down the list. I tested two:

  • ToEnd - Move most recently selected item to end of list

  • Sort - Keep an associated array of the number of times each item has been selected and sort the list from least to most selected.

I created a simulation to pick from a list and examine the standard deviation of the count that each item was selected. The lower the standard deviation the better. For example, 1 simulation for a list of 10 items where 50 selections where made created the spread:

{"a"=>5, "b"=>5, "c"=>6, "d"=>5, "e"=>4, "f"=>4, "g"=>5, "h"=>5, "i"=>6, "j"=>5}

The Standard Devation for this simulation was

0.63

With the ability to run a simulation, I then computed some meta-statistics by running the simulation 500 times and providing the average standard deviation for each method: ToEnd and Sort. I expected for the standard deviation to be high for low # of picks, but in fact for the ToEnd algorithm the Standard Deviation increased with the number of picks. The sort method fixed this.

Testing ["a", "b", "c", "d", "e"]
-------------------------
Picks    ToEnd (StdDev) Sort (StdDev)
5    0.59  0.57
10   0.76  0.68
15   0.93  0.74
20   1.04  0.74
25   1.20  0.73
30   1.28  0.73
35   1.34  0.74
40   1.50  0.75
45   1.53  0.75
45   1.56  0.77
80   2.04  0.79
125  2.52  0.74
180  3.11  0.77
245  3.53  0.79
320  4.05  0.75
405  4.52  0.76
500  5.00  0.78

Here are some test results for a larger set.

Testing ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
-------------------------
Picks  ToEnd (StdDev)  Sort (StdDev)
10       0.68  0.65
20       0.87  0.77
30         1.04  0.80
40       1.18  0.82
50         1.30  0.85
60         1.43  0.84
70       1.57  0.87
80       1.65  0.88
90       1.73  0.87
90       1.71  0.87
160      2.30  0.89
250      2.83  0.88
360      3.44  0.88
490      3.89  0.88
640      4.54  0.87
810      5.12  0.88
1000     5.66  0.85

With a good test framework, I couldn't resist trying a different random number transformation. My assumption was if I take the cube root instead of square root of x that my standard deviation would decrease. It did in fact, but I was worried that this would decrease the randomness. Here you can observe a few simulations when the formula is changed to

index = Int(l*(1-r^(0.33)) # l=length, r=uniform random var between 0 and 1

Now inspect the actual picks. As I thought, its very weighted to the beginning of the list at first. If you want to weight this heavily, you should probably randomize your list before you start.

StdDev = 0.632455532033676
{"a"=>10, "b"=>10, "c"=>11, "d"=>9, "e"=>10}
a d e c b c e b a d b e c a d d e b a e e c c b d a d c b c e b a a d d b e a e a b c b d c a c e c

StdDev = 0.0
{"a"=>10, "b"=>10, "c"=>10, "d"=>10, "e"=>10}
b c d a a d b c b a d e c d e c b e b a e e d c c a b a d a e e b d b a e c c e b a c c d d d a b e

StdDev = 0.632455532033676
{"a"=>9, "b"=>10, "c"=>10, "d"=>10, "e"=>11}
b d a e b c a d c e e b a c a d d c b c e d a e b b a c d c d a e a e e b d c b e a b c b c d d e e

StdDev = 0.0
{"a"=>10, "b"=>10, "c"=>10, "d"=>10, "e"=>10}
a b e d c e a b d c b c c a d b e e b e a d d c e a d b b c c a a a b e d d e c c a b b e a c d e d

StdDev = 0.632455532033676
{"a"=>11, "b"=>10, "c"=>9, "d"=>10, "e"=>10}
b a d c d c a e b e a e b c d b c a a d e e d c d e c b a b b e d c d b c e a a a d b c e b e a d a
jrhicks