ansaurus

Question

What is O value for naive random selection from finite set?

Answer 1

+2 A:

Your actual question is actually a lot more interesting than what I answered (and harder). I've never been any good at statistitcs (and it's been a while since I did any), but intuitively, I'd say that the run-time complexity of that algorithm would probably something like an exponential. As long as the number of elements picked is small enough compared to the size of the array the collision-rate will be so small that it will be close to linear time, but at some point the number of collisions will probably grow fast and the run-time will go down the drain.

If you want to prove this, I think you'd have to do something moderately clever with the expected number of collisions in function of the wanted number of elements. It might be possible do to by induction as well, but I think going by that route would require more cleverness than the first alternative.

EDIT: After giving it some thought, here's my attempt:

Given an array of m elements, and looking for n random and different elements. It is then easy to see that when we want to pick the ith element, the odds of picking an element we've already visited are (i-1)/m. This is then the expected number of collisions for that particular pick. For picking n elements, the expected number of collisions will be the sum of the number of expected collisions for each pick. We plug this into Wolfram Alpha (sum (i-1)/m, i=1 to n) and we get the answer (n**2 - n)/2m. The average number of picks for our naive algorithm is then n + (n**2 - n)/2m.

Unless my memory fails me completely (which entirely possible, actually), this gives an average-case run-time O(n**2).

arnsholt 2009-08-18 13:50:24

He specifically stated that the question was NOT how he could improve the algorithm. The question was what the complexity of the naive solution is, in a big-O sense. Please read the whole question before you answer.

Alderath 2009-08-18 13:59:59

Yeah, I realized that just after I posted my answer. That question was more interesting anyways, so I killed the original reply and replaced it with something relevant.

arnsholt 2009-08-18 14:05:55

Answer 2

+2 A:

There's a beautiful O(n) algorithm for this. It goes as follows. Say you have n items, from which you want to pick m items. I assume the function rand() yields a random real number between 0 and 1. Here's the algorithm:

items_left=n
items_left_to_pick=m
for j=1,...,n
    if rand()<=(items_left_to_pick/items_left)
        Pick item j
        items_left_to_pick=items_left_to_pick-1
    end
    items_left=items_left-1
end

It can be proved that this algorithm does indeed pick each subset of m items with equal probability, though the proof is non-obvious. Unfortunately, I don't have a reference handy at the moment.

Edit The advantage of this algorithm is that it takes only O(m) memory (assuming the items are simply integers or can be generated on-the-fly) compared to doing a shuffle, which takes O(n) memory.

Martin B 2009-08-18 13:51:15

Asker is not asking for a better way, but for a characterization of the time complexity of the current (bad) approach. btw, a proof of correctness of this (good) approach is in Knuth, I think.

AakashM 2009-08-18 14:02:25

You're right -- I was shooting from the hip there. Apologies...

Martin B 2009-08-18 15:07:52

Answer 3

+3 A:

If you're willing to make the assumption that your random number generator will always find a unique value before cycling back to a previously seen value for a given draw, this algorithm is O(m^2), where m is the number of unique values you are drawing.

So, if you are drawing m values from a set of n values, the 1st value will require you to draw at most 1 to get a unique value. The 2nd requires at most 2 (you see the 1st value, then a unique value), the 3rd 3, ... the mth m. Hence in total you require 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 draws. This is O(m^2).

Without this assumption, I'm not sure how you can even guarantee the algorithm will complete. It's quite possible (especially with a pseudo-random number generator which may have a cycle), that you will keep seeing the same values over and over and never get to another unique value.

==EDIT==

For the average case:

On your first draw, you will make exactly 1 draw. On your 2nd draw, you expect to make 1 (the successful draw) + 1/n (the "partial" draw which represents your chance of drawing a repeat) On your 3rd draw, you expect to make 1 (the successful draw) + 2/n (the "partial" draw...) ... On your mth draw, you expect to make 1 + (m-1)/n draws.

Thus, you will make 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) draws altogether in the average case.

This equals the sum from i=0 to (m-1) of [1 + i/n]. Let's denote that sum(1 + i/n, i, 0, m-1).

Then:

sum(1 + i/n, i, 0, m-1) = sum(1, i, 0, m-1) + sum(i/n, i, 0, m-1)
                        = m + sum(i/n, i, 0, m-1)
                        = m + (1/n) * sum(i, i, 0, m-1)
                        = m + (1/n)*[(m-1)*m]/2
                        = (m^2)/(2n) - (m)/(2n) + m

We drop the low order terms and the constants, and we get that this is O(m^2/n), where m is the number to be drawn and n is the size of the list.

Ambuoroko 2009-08-18 14:02:15

I'm not sure I understand the first paragraph. Why are you saying O(m^2)? I don't see the logic for it.

Beska 2009-08-18 14:29:20

If you take 1 + 2 + 3 + ... + m (sum up all the numbers from 1 to m), you get [m*(m+1)]/2 = (m^2 + m)/2. So when you drop out all the constants and smaller-order terms, you get O(m^2)

Ambuoroko 2009-08-18 14:32:04

Additionally, I don't think that the first assumption is valid. If you're excluding numbers, then it's not truly random. Random numbers could, conceivably, include a series of just 9's (just like the Dilbert cartoon.) To say that hitting a number twice before getting the correct one is not allowed avoids the problem. Just because a number has been hit twice, doesn't mean that the generator has started repeating a cycle. In particular, I'm concerned with an ideal generator, with no cycling.

Beska 2009-08-18 14:35:53

It's definitely true that there is no guarantee that the algorithm will complete. But in the average case, it will, which is the more interesting case.

Beska 2009-08-18 14:36:51

I am just about finished with a proof of the average-case behaviour. I will be editing my answer to include it once I convince myself it's correct. :)

Ambuoroko 2009-08-18 14:38:47

If the random generator only generates unique numbers, it should be O(n). The second number generated is guaranteed to not be the same as the previous, so your sequence of possible choices would be 1 + 1 + 1 + ... + 1 m times. The interesting part of the question, however, was the fact that collisions may occur at random, and that choosing "999,999 values from a set of 1,000,000" will result in a very long running time.

Alderath 2009-08-18 14:39:06

Alderath -- I was originally assuming that the random number generator generated unique numbers for a particular stage of the problem. Thus it might pick "9" for the first number, then "9" for the 2nd number, but after picking "9" for the second number you could not see "9" again until the 3rd number, etc.Otherwise there was a practical issue with an infinite loop. See my edited response now for an average-case run time analysis.

Ambuoroko 2009-08-18 15:15:25

This seems correct to me. I don't see anything wrong with the math.

Beska 2009-08-18 15:37:24

I like this proof. Essentially the same as my argument above, but stated a lot more clearly. Incidentally, I have to admit that I actually expected the average case to be worse than n squared.

arnsholt 2009-08-18 16:21:09

Unfortunately, this answer is also wrong -- the 1 + i/N term is incorrect. Think it through -- when i = N, you'll make quite a few more than the 2 draws that that formula predicts. See Accipitridae's answer for the correct solution.

Richard Dunlap 2009-08-18 19:39:32

It is not the solution itself that is wrong. It is the assumption that the random generator will work in such a way that, in each step it may generate items which are equal to previously selected items ONLY ONCE that is wrong. This has a crucial impact on the result. The question was about an ideal random generator, and the result in this answer is wrong in that case. Secondly, if you somehow have a random nr generator that works this way, it makes no sense to allow collisions at all. That type of generator could easily spawn non-colliding resulting in a time complexity of 1+1+1+...+1 = m

Alderath 2009-08-18 21:47:37

I do however like the comment stating that if bad pseudo random number generators are used, the naive algorithm might loop infinitely, if m and n are almost equal. That point is relevant to the question, even if the rest of the analysis is wrong.

Alderath 2009-08-18 21:55:46

For the time being, I'm withdrawing "accepted answer" from this response until I can evaluate the other answers in light of these comments.

Beska 2009-08-19 00:02:27

(I'm also hoping I'm not out of my league, math-wise. I'll do my best to evaluate everything correctly.)

Beska 2009-08-19 00:04:01

Answer 4

+3 A:

If you already have chosen i values then the probability that you pick a new one from a set of y values is

(y-i)/y.

Hence the expected number of trials to get (i+1)-th element is

y/(y-i).

Thus the expected number of trials to choose x unique element is the sum

 y/y + y/(y-1) + ... + y/(y-x+1)

This can be expressed using harmonic numbers as

y (H_y - H_y-x).

From the wikipedia page you get the approximation

H_x = ln(x) + gamma + O(1/x)

Hence the number of necessary trials to pick x unique elements from a set of y elements is

y (ln(y) - ln(y-x)) + O(y/(y-x)).

If you need then you can get a more precise approximation by using a more precise approximation for H_x. In particular, when x is small it is possible to improve the result a lot.

Accipitridae 2009-08-18 17:40:06

You are completely correct. My comment was based upon a stupid mistake by me misreading the brackets as y (ln(y) - ln(y-x) + O(y/(y-x))). I'm considering removing my previous comment in order to not incorrectly discredit your result, which is in fact correct. Should I do this?

Alderath 2009-08-19 07:29:48

Alderath, feel free to change or delete any comments you are not comfortable with. My last section only shows that the result does not contradict your estimates. Since the results are not strong I'm also considering to remove or replace this section. After all the OP was mainly interested in the case where x is large.

Accipitridae 2009-08-19 23:24:16

Answer 5

+4 A:

Variables

n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm

Reasoning

The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:

d(0) = 1

In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)². To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)³ to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:

d(1) = 1 + 1/n + (1/n)² + (1/n)³ + (1/n)⁴ + ...

Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:

d(i) = 1 + i/n + (i/n)² + (i/n)³ + (i/n)⁴ + ... =
= sum( (i/n)^k ), where k ∈ [0,∞]

This is a geometric sequence and hence it is easy to compute it's sum:

d(i) = (1 - i/n)^-1

The overall complexity is then computed by summing the expected amount of tries in each step:

T(m,n) = sum ( d(i) ), where i ∈ [0,m-1] =
= 1 + (1 - 1/n)^-1 + (1 - 2/n)^-1 + (1 - 3/n)^-1 + ... + (1 - (m-1)/n)^-1

Extending the fractions in the series above by n, we get:

T(m,n) = n/n + n/(n-1) + n/(n-2) + n/(n-3) + ... + n/(n-m+2) + n/(n-m+1)

We can use the fact that:

n/n ≤ n/(n-1) ≤ n/(n-2) ≤ n/(n-3) ≤ ... ≤ n/(n-m+2) ≤ n/(n-m+1)

Since the series has m terms, and each term satisfies the inequality above, we get:

T(m,n) ≤ n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + ... + n/(n-m+1) + n/(n-m+1) =
= m*n/(n-m+1)

It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)

Conclusion

This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.

Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.

Alderath 2009-08-18 20:36:22

Big-O isn't quite what you've computed; that describes the growth of the worst-case. In the worst case, a pessimal random-number generator is't very random and you lose. Random algorithms have to be described by average/expected case behavior, and you've done that splendidly.

Ian 2010-09-05 04:54:58

Answer 6

A:

Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.

If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.

Concluding, if you need m unique objects, randomly selected, this algorithm gives you

n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))

which, as Alderath showed, can be estimated by

m*n / (n-m+1).

You can see a little bit more from this formula: * The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical). * You can expect really long computation times when m is close to n, especially if n is large.

In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.

for i = 1..m
  k = randInt(i, n)
  exchange(i, k)
end

here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).

Martijn 2009-08-24 12:15:40

Answer 7

+1 A:

I. J. Kennedy 2009-10-20 22:09:09

Great, this is exactly an answer you can relate to. ''NlogN'' is a quite good upper bound, when ''M'' is close to ''N'', which is of cource the interesting/worst case situation.

Thomas Ahle 2010-01-30 21:19:15

Answer 8

A:

Most people forget that looking up, if the number has already run, also takes a while.

The number of tries nessesary can, as descriped earlier, be evaluated from:

T(n,m) = n(H(n)-H(n-m)) ⪅ n(ln(n)-ln(n-m))

which goes to n*ln(n) for interesting values of m

However, for each of these 'tries' you will have to do a lookup. This might be a simple O(n) runthrough, or something like a binary tree. This will give you a total performance of n^2*ln(n) or n*ln(n)^2.

For smaller values of m (m < n/2), you can do a very good approximation for T(n,m) using the HA-inequation, yielding the formula:

2*m*n/(2*n-m+1)

As m goes to n, this gives a lower bound of O(n) tries and performance O(n^2) or O(n*ln(n)).

All the results are however far better, that I would ever have expected, which shows that the algorithm might actually be just fine in many non critical cases, where you can accept occasional longer running times (when you are unlucky).

Thomas Ahle 2010-01-30 22:01:03

ansaurus

tags:

views:

answers:

What is O value for naive random selection from finite set?

Variables

Reasoning

Conclusion

related questions