views:

311

answers:

2

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.

A sketch simple algorithm is

a <- the array
c <- 0
for i in a:
    if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
    while j is false: j++
return j

Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?

+6  A: 

Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

Joey
That certainly is faster than what I came up with, though it uses O(n) working space, where mine uses only constant working space. So there still might be room for improvement.
momeara
Is it certainly faster? If true values are very rare, then it's almost certainly faster. If false values are very rare, then it's almost certainly slower. Where the break-even point is, I don't know.
Steve Jessop
Yes, certainly the distribution of true/false values does matter for the question which algorithm is more efficient. But when that's not known all bets are off, as usual. Still, I find Jon's answer very nice and likely to be better than this.
Joey
@Steve: It really depends on how often you need to do it. Building the index only happens once, so even if false values are extremely rare, using the index will eventually be faster.
Jerry Coffin
Yes, I hadn't thought of that: there may be guarantees that a certain number of selections will be done before the next time the contents of the array are changed. If we're allowed to widen the scope a little, then it might be better to get rid of the array of flags entirely, and just keep a structure containing the "true" indices. Insertion/removal will be slower, but selection will be faster.
Steve Jessop
+5  A: 

Use the "pick a random element from an infinite list" algorithm.

Keep an index of your current pick, and also a count of how many true values you've seen.

When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)

This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.

See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.

Jon Skeet
A few more details and a proof here: http://stackoverflow.com/questions/1133942/what-is-the-most-efficient-way-to-pick-a-random-card-from-a-deck-when-some-cards/1134286#1134286. This question is functionally a duplicate of that one, although phrased a bit differently. My instinct is that it will most likely be slower than the two-pass algorithm, assuming data in memory. But worth testing, if the two-pass performance is unacceptable for whatever reason.
Steve Jessop
@Steve: It depends on the sparseness of "true" values vs cost of generating a random number. If you have a million entries in the list, only 2 of which are "true", then this is likely to be a win. If, on the other hand, you have a million entries *all* of which are true, the two-pass algorithm will probably be faster. In general I just like the elegance of one-pass constant storage algorithms :)
Jon Skeet
Heh, I just made the same comment about sparseness on Johannes' answer. I agree about elegance, too, although I worry slightly that using a lot of random numbers makes it harder to analyse the effects of any weaknesses in the RNG.
Steve Jessop
Thanks Jon and Steve, I suspected there was some technique like this but I wasn't aware of it before!
momeara