views:

500

answers:

5

Suppose I have a list, called elements, each of which does or does not satisfy some boolean property p. I want to choose one of the elements that satisfies p by random with uniform distribution. I do not know ahead of time how many items satisfy this property p.

Will the following code do this?:

pickRandElement(elements, p)
     randElement = null
     count = 0
     foreach element in elements
          if (p(element))
               count = count + 1
               if (randInt(count) == 0)
                    randElement = element

     return randElement

(randInt(n) returns a random int k with 0 <= k < n.)

+6  A: 

Yes, I believe so.

The first time you encounter a matching element, you definitely pick it. The next time, you pick the new value with a probability of 1/2, so each of the two elements have an equal chance. The following time, you pick the new value with a probability of 1/3, leaving each of the other elements with a probability of 1/2 * 2/3 = 1/3 as well.

I'm trying to find a Wikipedia article about this strategy, but failing so far...

Note that more generally, you're just picking a random sample out of a sequence of unknown length. Your sequence happens to be generated by taking an initial sequence and filtering it, but the algorithm doesn't require that part at all.

I thought I'd got a LINQ operator in MoreLINQ to do this, but I can't find it in the repository... EDIT: Fortunately, it still exists from this answer:

public static T RandomElement<T>(this IEnumerable<T> source,
                                 Random rng)
{
    T current = default(T);
    int count = 0;
    foreach (T element in source)
    {
        count++;
        if (rng.Next(count) == 0)
        {
            current = element;
        }            
    }
    if (count == 0)
    {
        throw new InvalidOperationException("Sequence was empty");
    }
    return current;
}
Jon Skeet
Jon, as far as I see it, this algorithm will ALWAYS pick the first element that meets p, what I am missing?
tekBlues
@tekBlues: it keeps going after it's picked the first one.
AakashM
I'm pretty sure this algorithm works, if the random generator does its job properly.
ammoQ
aakashm, he says he only wants to pick one element
tekBlues
Instead of "picks", think of the algorithm as "holding" elements. It holds the first one, then might drop it to hold the second one, etc. At the end, you're left holding some element, which is the one that you return.
Nate Kohl
@tekBlues: It picks the first one, then potentially replaces its pick with the next matching one, then potentially replaces it again etc. Note that it *doesn't* return the moment it's picked one - it returns when it's gone through the whole sequence.
Jon Skeet
A: 

For clarity's sake, I would try:

pickRandElement(elements, p)
     OrderedCollection coll = new OrderedCollection
     foreach element in elements
          if (p(element))
               coll.add(element)
     if (coll.size == 0) return null
     else return coll.get(randInt(coll.size))

To me, that makes it MUCH clearer what you're trying to do and is self-documenting. On top of that, it's simpler and more elegant, and it's now obvious that each will be picked with an even distribution.

JoeCool
That's the code we have now (and I admit it is more clear). I'm hoping for something more efficient. Creating a list and adding elements to it, I'm guessing, is somewhat inefficient and wasteful, if the proposed alternative will work.
Paul Reiners
Once you see what the proposed algorithm is doing, it's extremely elegant IMO.
Jon Skeet
Yes, you're right, if efficiency is the top priority. I would say that the more obvious solution that I provide is more readable though.
JoeCool
+3  A: 

In The Practice of Programming, pg. 70, (The Markov Chain Algorithm) there is a similar algorithm for that:

[...]
  nmatch = 0;
  for ( /* iterate list */ )
    if (rand() % ++nmatch == 0) /* prob = 1/nmatch */
      w = suf->word;
[...]

"Notice the algorithm for selecting one item at random when we don't know how many items there are. The variable nmatch counts the number of matches as the list is scanned. The expression

rand() % ++nmatch == 0

increments nmatch and is then true with probability 1/nmatch."

Nick D
+10  A: 

It works mathematically. Can be proven by induction.

Clearly works for n = 1 element satisfying p.

For n+1 elements, we will choose the element n+1 with probability 1/(n+1), so its probability is OK. But how does that effect the end probability of choosing one of the prior n elements?

Each of the prior n had a chance of being selected with probability 1/n, until we found element n+1. Now, after finding n+1, there is a 1/(n+1) chance that element n+1 is chosen, so there is a n/(n+1) chance that the previously chosen element remains chosen. Which means its final probability of being the chosen after n+1 finds is 1/n * (n/n+1) = 1/n+1 -- which is the probability we want for all n+1 elements for uniform distribution.

If it works for n = 1, and it works for n+1 given n, then it works for all n.

John Pirie
Induction has saved my butt too many times in computation proofs!
JoeCool
There's actually a simpler way to prove this.For n elements, we will choose the element n with probability 1/n. But what about the prior n-1 elements?Well, under induction we know that all these n-1 elements have the same probability. So the probability for each _must_ be 1/n, since 1/n is the only number that's 1 when multiplied with n. qed :)
FeepingCreature
+1  A: 

decowboy has a nice proof that this works on TopCoder

Paul Reiners