views:

120

answers:

3

I'm programming in R. I've got a vector containing, let's say, 1000 values. Now let's say I want to partition these 1000 values randomly into two new sets, one containing 400 values and the other containing 600. How could I do this? I've thought about doing something like this...

firstset <- sample(mydata, size=400)

...but this doesn't partition the data (in other words, I still don't know which 600 values to put in the other set). I also thought about looping from 1 to 400, randomly removing 1 value at a time and placing it in firstset. This would partition the data correctly, but how to implement this is not clear to me. Plus I've been told to avoid for loops in R whenever possible.

Any ideas?

+7  A: 

Instead of sampling the values, you could sample their positions.

positions <- sample(length(mydata), size=400)  # ucfagls' suggestion
firstset <- mydata[positions]
secondset <- mydata[-positions]

EDIT: ucfagls' suggestion will be more efficient (especially for larger vectors), since it avoids allocating a vector of positions in R.

Joshua Ulrich
Very cool idea. Thanks!
Daniel Standage
The first line can be simplified to `positions <- sample(length(mydata), size=400)` so you don't need to generate the vector from which to sample. The first argument is allowed to be a positive integer. Or even to `positions <- sample(mydata, size=400)`.
Gavin Simpson
Surely positions <- sample(mydata, size=400) will return actual values from mydata and not positions? You'll not be able to get the other 600. You got it right first time!
Spacedman
+4  A: 

If mydata is truly a vector, one option would be:

split(mydata, sample(c(rep("group1", 600), rep("group2", 400))))
Greg
I did not know the first argument of 'sample' could be a vector. Thanks!
Daniel Standage
Additionally, this will store both subsets of the original data in one object (list), keeping the global workspace from getting cluttered.
Greg
+3  A: 

Just randomize mydata and take the first 400 and then last 600.

mydata <- sample(mydata)
firstset <- mydata[1:400]
secondset <- mydata[401:1000]
John