views:

12195

answers:

12

I need a quick algorithm to select a random 5 elements from a generic list. For example, I'd like to get a random 5 elements from a List.

+27  A: 

Iterate through and for each element make the probability of selection = (number needed)/(number left)

So if you had 40 items, the first would have a 5/40 chance of being selected. If it is, the next has a 4/39 chance, otherwise it has a 5/39 chance. By the time you get to the end you will have your 5 items, and often you'll have all of them before that.

Kyle Cronin
elegant! I was wondering if there is a way that we can make this algorithm work with weighted elements?
nimcap
I feel this is subtlely wrong. It seems like the back end of the list will get picked more often than the front end as the back end will see much larger probabilities. For example, if the first 35 numbers do not get picked, the last 5 numbers have to get picked. The first number will only ever see a 5/40 chance, but that last number will see 1/1 more often than 5/40 times.You will have to randomize the list first before you implement this algorithm.
Ankur Goel
ok, I ran this algorithm 10 million times on a list of 40 elements, each with a 5/40 (.125) shot at getting selected, and then ran that simulation several times. It turns out that this is not evenly distributed. Elements 16 thru 22 get underselected (16 = .123, 17 = .124), while element 34 gets overselected (34 =.129). Elements 39 and 40 also get underselected but not by as much (39 = .1247, 40 = .1246)
Ankur Goel
@Ankur: I don't believe that's statistically significant. I believe there is an inductive proof that this will provide an even distribution.
recursive
I've repeated the same trial 100 million times, and in my trial the least chosen item was chosen less than 0.106% less frequently than the most frequently chosen item.
recursive
I see Kyle's proof below. And I can't say I disagree. But run the same 100 million trial several times, and see if you see same distribution across trials.
Ankur Goel
+11  A: 

This is actually a harder problem than it sounds like, mainly because many mathematically-correct solutions will fail to actually allow you to hit all the possibilities (more on this below).

First, here are some easy-to-implement, correct-if-you-have-a-truly-random-number generator:

(0) Kyle's answer, which is O(n).

(1) Generate a list of n pairs [(0, rand), (1, rand), (2, rand), ...], sort them by the second coordinate, and use the first k (for you, k=5) indices to get your random subset. I think this is easy to implement, although it is O(n log n) time.

(2) Init an empty list s = [] that will grow to be the indices of k random elements. Choose a number r in {0, 1, 2, ..., n-1} at random, r = rand % n, and add this to s. Next take r = rand % (n-1) and stick in s; add to r the # elements less than it in s to avoid collisions. Next take r = rand % (n-2), and do the same thing, etc. until you have k distinct elements in s. This has worst-case running time O(k^2). So for k << n, this can be faster. If you keep s sorted and track which contiguous intervals it has, you can implement it in O(k log k), but it's more work.

@Kyle - you're right, on second thought I agree with your answer. I hastily read it at first, and mistakenly thought you were indicating to sequentially choose each element with fixed probability k/n, which would have been wrong - but your adaptive approach appears correct to me. Sorry about that.

Ok, and now for the kicker: asymptotically (for fixed k, n growing), there are n^k/k! choices of k element subset out of n elements [this is an approximation of (n choose k)]. If n is large, and k is not very small, then these numbers are huge. The best cycle length you can hope for in any standard 32 bit random number generator is 2^32 = 256^4. So if we have a list of 1000 elements, and we want to choose 5 at random, there's no way a standard random number generator will hit all the possibilities. However, as long as you're ok with a choice that works fine for smaller sets, and always "looks" random, then these algorithms should be ok.

Addendum: After writing this, I realized that it's tricky to implement idea (2) correctly, so I wanted to clarify this answer. To get O(k log k) time, you need an array-like structure that supports O(log m) searches and inserts - a balanced binary tree can do this. Using such a structure to build up an array called s, here is some pseudopython:

# Returns a container s with k distinct random numbers from {0, 1, ..., n-1}
def ChooseRandomSubset(n, k):
  for i in range(k):
    r = UniformRandom(0, n-i)                 # May be 0, must be < n-i
    q = s.FirstIndexSuchThat( s[q] - q > r )  # This is the search.
    s.InsertInOrder(q ? r + q : r + len(s))   # Inserts right before q.
  return s

I suggest running through a few sample cases to see how this efficiently implements the above English explanation.

Tyler
for (1) you can shuffle a list faster than sorting is, for (2) you will be biasing your distribution by using %
jk
+1  A: 

@Tyler

I may not be up on my number theory, but I still don't understand why my algorithm would produce biased results. Maybe my hand-done test trials use numbers too small (or my math is in error) in which case I'd welcome some input.

Select 1 of {a, b, c}:

P(a) = 1/3

P(b) = (2/3)*(1/2) = 1/3

P(c) = (2/3)*(1/2)*(1/1) = 1/3

Select 2 of {a, b, c, d}

P(ab) = (1/2)*(1/3) = 1/6

P(ac) = (1/2)*(2/3)*(1/2) = 1/6

P(ad) = (1/2)*(2/3)*(1/2)*(1/1) = 1/6

P(bc) = (1/2)*(2/3)*(1/2) = 1/6

P(bd) = (1/2)*(2/3)*(1/2)*(1/1) = 1/6

P(cd) = (1/2)*(1/3)*(1/1)*(1/1) = 1/6

Kyle Cronin
A: 

This is the best I could come up with on a first cut:

public List<String> getRandomItemsFromList(int returnCount, List<String> list)
{
    List<String> returnList = new List<String>();
    Dictionary<int, int> randoms = new Dictionary<int, int>();

    while (randoms.Count != returnCount)
    {
        //generate new random between one and total list count
        int randomInt = new Random().Next(list.Count);

        // store this in dictionary to ensure uniqueness
        try
        {
            randoms.Add(randomInt, randomInt);
        }
        catch (ArgumentException aex)
        {
            Console.Write(aex.Message);
        } //we can assume this element exists in the dictonary already 

        //check for randoms length and then iterate through the original list 
        //adding items we select via random to the return list
        if (randoms.Count == returnCount)
        {
            foreach (int key in randoms.Keys)
                returnList.Add(list[randoms[key]]);

            break; //break out of _while_ loop
        }
    }

    return returnList;
}

Using a list of randoms within a range of 1 - total list count and then simply pulling those items in the list seemed to be the best way, but using the Dictionary to ensure uniqueness is something I'm still mulling over.

Also note I used a string list, replace as needed.

Ian S
+2  A: 

From Dragons in the Algorithm, an interpretation in C#:

int k = 10; // items to select
var items = new List<int>(new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 });
var selected = new List<int>();
var needed = k;
var available = items.Count;
var rand = new Random();
while (selected.Count < k) {
   if( rand.NextDouble() < needed / available ) {
      selected.Add(items[available-1])
      needed--;
   }
   available--;
}

This algorithm will select unique indicies of the items list.

spoulson
A: 

I recently did this on my project using an idea similar to Tyler's point 1.
I was loading a bunch of questions and selecting five at random. Sorting was achieved using an IComparer.
aAll questions were loaded in the a QuestionSorter list, which was then sorted using the List's Sort function and the first k elements where selected.

    private class QuestionSorter : IComparable<QuestionSorter>
    {
        public double SortingKey
        {
            get;
            set;
        }

        public Question QuestionObject
        {
            get;
            set;
        }

        public QuestionSorter(Question q)
        {
            this.SortingKey = RandomNumberGenerator.RandomDouble;
            this.QuestionObject = q;
        }

        public int CompareTo(QuestionSorter other)
        {
            if (this.SortingKey < other.SortingKey)
            {
                return -1;
            }
            else if (this.SortingKey > other.SortingKey)
            {
                return 1;
            }
            else
            {
                return 0;
            }
        }
    }

Usage:

 List<QuestionSorter> unsortedQuestions = new List<QuestionSorter>();

 // add the questions here

 unsortedQuestions.Sort(unsortedQuestions as IComparer<QuestionSorter>);

 // select the first k elements
hitec
A: 

why not something like this:

 Dim ar As New ArrayList
    Dim numToGet As Integer = 5
    'hard code just to test
    ar.Add("12")
    ar.Add("11")
    ar.Add("10")
    ar.Add("15")
    ar.Add("16")
    ar.Add("17")

    Dim randomListOfProductIds As New ArrayList

    Dim toAdd As String = ""
    For i = 0 To numToGet - 1
        toAdd = ar(CInt((ar.Count - 1) * Rnd()))

        randomListOfProductIds.Add(toAdd)
        'remove from id list
        ar.Remove(toAdd)

    Next
'sorry i'm lazy and have to write vb at work :( and didn't feel like converting to c#
cellis
+8  A: 

I think the selected answer is correct and pretty sweet. I implemented it differently though, as I also wanted the result in random order.

    static IEnumerable<SomeType> PickSomeInRandomOrder<SomeType>(
        IEnumerable<SomeType> someTypes,
        int maxCount)
    {
        Random random = new Random(DateTime.Now.Millisecond);

        Dictionary<double, SomeType> randomSortTable = new Dictionary<double,SomeType>();

        foreach(SomeType someType in someTypes)
            randomSortTable[random.NextDouble()] = someType;

        return randomSortTable.OrderBy(KVP => KVP.Key).Take(maxCount).Select(KVP => KVP.Value);
    }
Frank Schwieterman
AWESOME! Really helped me out!
Atømix
Do you have any reason not to use new Random() which is based on Environment.TickCount vs. DateTime.Now.Millisecond?
lasseespeholt
No, just wasn't aware that default existed.
Frank Schwieterman
An inprovement of the randomSortTable: randomSortTable = someTypes.ToDictionary(x => random.NextDouble(), y => y); Saves the foreach loop.
Keltex
Ahh so thats how ToDictionary() works. Thanks
Frank Schwieterman
+1  A: 

It is a lot harder than one would think. See the great Article "Shuffling" from Jeff.

I did write a very short article on that subject including C# code:
Return random subset of N elements of a given array

Tobias Hertkorn
+14  A: 

Using linq:

YourList.OrderBy(x => rnd.Next()).Take(5)
ersin
+1 But if two elements gets the same number from rnd.Next() or similar then the first will be selected and the second will possibly not (if no more elements is needed). It is properly random enough depending on usage, though.
lasseespeholt
A: 

The simple solution I use (probably not good for large lists): Copy the list into temporary list, then in loop randomly select Item from temp list and put it in selected items list while removing it form temp list (so it can't be reselected).

Example:

List<Object> temp = OriginalList.ToList();
List<Object> selectedItems = new List<Object>();
Random rnd = new Random();
Object o;
int i = 0;
while (i < NumberOfSelectedItems)
{
            o = temp[rnd.Next(temp.Count)];
            selectedItems.Add(o);
            temp.Remove(o);
            i++;
 }
smoke4fun
A: 

Here's my approach (full text here http://krkadev.blogspot.com/2010/08/random-numbers-without-repetition.html ).

It should run in O(K) instead of O(N), where K is the number of wanted elements and N is the size of the list to choose from:

public <T> List<T> take(List<T> source, int k) {
 int n = source.size();
 if (k > n) {
   throw new IllegalStateException(
     "Can not take " + k +
     " elements from a list with " + n +
     " elements");
 }
 List<T> result = new ArrayList<T>(k);
 Map<Integer,Integer> used = new HashMap<Integer,Integer>();
 int metric = 0;
 for (int i = 0; i < k; i++) {
   int off = random.nextInt(n - i);
   while (true) {
     metric++;
     Integer redirect = used.put(off, n - i - 1);
     if (redirect == null) {
       break;
     }
     off = redirect;
   }
   result.add(source.get(off));
 }
 assert metric <= 2*k;
 return result;
}
Kristofer