views:

174

answers:

10

Let's say that I have a list of data: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} where n = 10 elements

I'd like to randomly choose k elements of this set to form a sublist, say k = 5.

In that case, I could end up with a sublist that looks like {9, 3, 5, 2, 7}

I could accomplish this by:

  • Randomly determining an offset within the list, between 0 and the current size of the list minus 1
  • Appending that element to my sublist
  • Erasing that element from the original list
  • Repeat until the desired size is found

The problem with this is that as the original list grows the offset and deletion time grows as well, and for any significantly large list (say over 1,000,000 elements), it takes quite a long time to perform this algorithm.

Is there a faster way to generate a random sequence from a list of given data? The implementation of the random number generator should be set aside for this problem, instead, focusing on how the RNG result is used in a proposed algorithm.

Any thoughts?

Right now I'm using the C++ STL list

A: 

You could shuffle it with std::random_shuffle and then just copy the first however many elements you want into a new list.

Cogwheel - Matthew Orlando
+7  A: 

I would use random_shuffle. You can change the generator by supplying a third parameter.

It requires random access iterators, so you can either switch to a std::vector (which is generally far superior and preferred over std::list, arguably the worse container), or just operate on some array. I'll demonstrate both:

int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10); 

// or

std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());

Now everything is in random order, just treat the fist k elements as your subset:

// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);

// or
data.resize(k); // shrink vector

Note that in another question, Jerry shares an excellent way of doing what you want.

GMan
Then just read off the first `n` elements from this shuffled list, and then add them to your new list.
sigint
If `k` is much smaller than `n`, then this is doing more work than necessary.
Mike Seymour
@Mike: Good point. I've added a link to Jerry's answer in another question since it seems to be the superior method.
GMan
+1  A: 

Shuffle the list, then take the first (or last) k elements. If you use a O(n) algorithm like the Fisher-Yates shuffle, then the whole process is O(n).

Nick Meyer
A: 

Shuffle your array using some algorithm Then you can peek random elements from the beginning of array.

spektom
+1  A: 

Or you could accomplish this by:

  • Randomly determining an offset within the list, between 0 and the current size of the list.
  • Appending that element to your sublist.
  • Repeat until the sublist is probably long enough to contain the right number of elements. For example, if you are choosing 10 out of 1,000,000 elements a sublist of 10 is probably long enough. You don't need to be hyper-accurate in your calculation of what number of extra elements you have to choose
  • Now check that all elements in the sublist are different. If not, delete the duplicates. If your sublist is now too short choose some more from the main list. If not, you're done.

I'm not sure why you want to delete the chosen elements from the main list, but if that is essential you could do it after constructing the sublist.

And I haven't a clue about how the performance of this approach will rate against the performance of the of the suggested random_shuffle of a list of 10^6 elements.

High Performance Mark
This is the bogo sort of random sampling. If k <<< n, the performance might be okay, but as k gets larger, the performance will degenerate quickly. Plus, it requires the original list be made up of unique elements.
Dennis Zickefoose
+1  A: 

A minimal example using OutputIterators and std::random_shuffle. Notice that the algorithm will modify your original input, so it could be reasonable to make a copy before you call the function.

#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>

template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
  std::random_shuffle(begin, end);
  It end2 = begin;
  std::advance(end2, n);
  std::copy(begin, end2, out);
}

int main() {
  std::vector<int> a;
  int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  take_random_n(b, b + 10, std::back_inserter(a), 4);
  for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
    std::cout << *it << " ";
}
pmr
You should note that the iterators must be random access iterators. Usually, this is somewhat documented in the template parameter list using a name like, well, RandomAccessIterator.
Luc Touraille
+4  A: 

http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm

Look under Examples > Modern method

You don't need to shuffle your entire list. O(k) (better than O(n))

Tom Sirgedas
A: 

Assign a random number to each entry in your list, then sort the list by random number. Pick off the first n entries you want.

A: 

Most answers propose to shuffle the initial container. If you don't want it to be modified, you can still use this approach, but you first need to copy the container. The solution of @pmr (which is nice because he makes it into a function) would then become:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename std::iterator_traits<InputIterator>::value_type value_type;

    std::vector<value_type> shufflingVec(first, last);

    std::random_shuffle(shufflingVec.begin(), shufflingVec.end());

    std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}

However, copying the entire container can be quite expensive if the elements contained are heavy and take some time to copy. In this case, you can be better off shuffling a list of indexes:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename 
        std::iterator_traits<InputIterator>::value_type      value_type;
    typedef typename 
        std::iterator_traits<InputIterator>::difference_type difference_type;

    difference_type size = std::distance(first, last);

    std::vector<value_type> indexesVec(
        boost::counting_iterator<size_t>(0),
        boost::counting_iterator<size_t>(size));

    // counting_iterator generates incrementing numbers. Easy to implement if you
    // can't use Boost

    std::random_shuffle(indexesVec.begin(), indexesVec.end());

    for (Size i = 0 ; i < n ; ++i)
    {
        *result++ = *std::advance(first, indexesVec[i]);
    }
}

// Disclaimer: I have not tested the code above!

You'll notice that the latter solution will perform very differently depending on the kind of iterators you use: with random access iterators (like pointers or vector<T>::iterator), it will be ok, but with other types of iterators, the use of std::distance and the numerous calls to std::advance can induce quite an overhead.

Luc Touraille
A: 

My 2 cents (using stl only & needing at most forward iterators):

//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
  DiffType operator()( DiffType i ){
    return std::rand() % i;
  }
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
//  - the first starts at the begining of the range
// and moves one element at a time for n times
//  - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;

FwdIter first = begin;
FwdIter second = begin;

difference_type dist  = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second

  do{
    offset = Func( dist ) % dist;
    index += offset;
    if( index >= dist ){
      second = begin;
      index = offset = index % dist;
    }
    std::advance( second, offset );

    std::swap( *first++, *second );
  }while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );

  std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
  std::cout << std::endl;
  RandomlyRandom< std::list< int >::difference_type > rand;

  for( int i = 0; i < 100;  i++ ){
    random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
    std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
    std::cout << std::endl;
  }

  return 0;
}
//-----------------------------------------------------------------------------
Eugen Constantin Dinca