views:

306

answers:

4

I have a case where I need to select a random item, but I don't know the total number of items and I don't want to build a huge array then pick an item out. For example, this is what I have right now:

List<string> items;
while (true)
{
    string item = GetNextItem();
    if (item == null)
        break;
}
int index = random.GetNext(0, items.count);

As you can see, I'm building a gigantic collection that I really don't need, I just need a random number between 0 and the number of items. Here is what I am thinking of doing, and it works, but I'd like to know if any of the experts out there can find a fault with it:

int index = -1;
int total;
string selectedItem;
while (true)
{
    string item = GetNextItem();
    if (item == null)
        break;

    ++total;
    int rnd = random.Next(0, total);
    if (rnd == total- 1)
    {
        index = total- 1;
        selectedItem = item;
    }
}

This gives me my index number, and the randomly selected item. My thinking behind this is that when there are 3 total items, for example, I pick a random number between 0 and 2 (inclusive) and if it's equal to 2 I use the new item as the selected item, if not just ignore it. As the total number of items increases, each new item's chance of being selected decreases accordingly.

Is this method "good"? Is it as "random" as building the array and picking an item out later? Is it as fast as it can be? Please guide me through my ignorance in random numbers. :)

+9  A: 

What you're doing will work.

Here's a restating of it that might make the algorithm slightly more clear:

  1. Select the first item, there is a 100% chance it will be the current selection
  2. If there is a second item, there is a 1/2 chance it will replace the current selection (If you do the math, then it's a 50% chance it will be the first item, and a 50% chance it will be the second item)
  3. If there is a third item, there is a 1/3 chance it will replace the current selection (again, the math the probability for each item being 1/3)
  4. If there is a fourth item, there is a 1/4 chance it will replace the current selection
  5. ... etc ...

Note that you should be able to compute a 1/x chance by saying rand.Next(0,x) == 0 (or any other integer between 0 and x - 1 inclusive; you don't have to bother using total - 1.

It's actually a pretty neat approach; initially I thought there wasn't going to be any good way of doing what you were asking!

Daniel LeCheminant
I am pretty bad at probability and statistics but this looks like the best option without knowing the upper bound ahead of time.
Josh Einstein
Bingo, that is what I was thinking in my rambling description. Can anyone out there find a fault with this, as far as picking random items goes?
Jon Tackabury
@Jon, no fault to be found with this appoach: it's a very classic, even traditional algorithm, published e.g. in Knuth's "Art of Computer Programming" books -- see for example http://geomblog.blogspot.com/2008/01/happy-birthday-don-knuth.html .
Alex Martelli
@Alex: Thanks for the excellent link!
Jon Tackabury
This is indeed a well known algorithm. The generalised form, for picking several elements, is called Reservoir Sampling (http://en.wikipedia.org/wiki/Reservoir_sampling).
Nick Johnson
A: 

In your first code snippet, you use items.count, so you know how many elements you have. You need to know this number so that each element has an equal chance of being selected.

As you wrote, you generate a random number i such that 0 <= i < items.count, and then you try to quickly access element i of the list. (A linked list might not be a good choice of data structure.)

If you have a good estimate N of the number of items, you can use this instead of items.count.

In your second code snippet, you might have to initialize "total" to zero.

Winston C. Yang
+2  A: 

Your approach looks good, yes.

1 item = gets selected

2 items = 50% chance you pick the 2nd item to replace the 1st

3 items = 33% chance you pick the 3rd item, 67% chance you pick one of first two items

4 items = 25% chance you pick 4th item, 75% chance you pick ...

...

So contrary to most of the other responses here I think you have a working solution that gives an even probability distribution.

You could simplify the random check:

 int rnd = random.Next(0, total);
    if (rnd == 0)

As it doesn't matter which of the total-1 values you test for to get the 1/n probability.

Hightechrider
A: 

we can prove it by induction.
it is true for 1;
if it is true for n; it is true for n+1;
=> prob. of selection for first n elements = 1/n
=> sice prob. of selecting (n+1)th element is 1/(n+1)
=> prob of not selecting (n+1)th element is n/(n+1)
=> prob of selection for first n elements after adding (n+1)th element = 1/n*(n/n+1)=1/n+1