views:

2784

answers:

7

Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

How can I most efficiently write a function that will return k completely random numbers from the list?

+2  A: 

I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.

If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)

Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.

Tom Hawtin - tackline
Hi Tom - Sorry, I don't quite understand - What do you mean by "First find your k random numbers". The objective is to find the k random numbers taken from the list - or am I misunderstanding something.
Matt Sheppard
A: 

Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.

Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.

Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.

Christian Oudard
A: 

Why can't you just do something like

List GetKRandomFromList(List input, int k)
  List ret = new List();
  for(i=0;i<k;i++)
    ret.Add(input[Math.Rand(0,input.Length)]);
  return ret;

I'm sure that you don't mean something that simple so can you specify further?

George Mauer
+11  A: 

This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

Bill the Lizard
@Bill: Any idea why this is called **reservoir** sampling?
Lazer
@Lazer: I'm not sure if [Vitter](http://www.cs.umd.edu/~samir/498/vitter.pdf) (PDF) coined the term originally, but he phrases the problem as "selecting a random sample of n records without replacement from a **pool** of N records..." (emphasis mine). He gives a definition in section 2 that starts out 'The first step of any reservoir algorithm is to put the first n records of the file into a “reservoir.”...'
Bill the Lizard
thanks, @Bill .
Lazer
A: 

If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).

It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).

mweerden
A: 

It is the Reservoir Sampling algorithm, however maths and algorithm explanation in the gregable blog is a bit convoluted and hard to follow. I worked out the maths directly and tried to explain how the algorithm works more cleanly and clearly here:

http://code-slim-jim.blogspot.com/2010/06/reservoir-sampling.html

Ashley