tags:

views:

986

answers:

2

I know that R works most efficiently with vectors and looping should be avoided. I am having a hard time teaching myself to actually write code this way. I would like some ideas on how to 'vectorize' my code. Here's an example of creating 10 years of sample data for 10,000 non unique combinations of state, plan1 and plan2:

st<-NULL
p1<-NULL
p2<-NULL
year<-NULL
i<-0
starttime <- Sys.time()

while (i<10000) {
    for (years in seq(1991,2000)) {
     st<-c(st,sample(c(12,17,24),1,prob=c(20,30,50)))
     p1<-c(p1,sample(c(12,17,24),1,prob=c(20,30,50)))
     p2<-c(p2,sample(c(12,17,24),1,prob=c(20,30,50))) 
     year <-c(year,years)
    }
     i<-i+1
}
Sys.time() - starttime

This takes about 8 minutes to run on my laptop. I end up with 4 vectors, each with 100,000 values, as expected. How can I do this faster using vector functions?

As a side note, if I limit the above code to 1000 loops on i it only takes 2 seconds, but 10,000 takes 8 minutes. Any idea why?

+3  A: 

Clearly I should have worked on this for another hour before I posted my question. It's so obvious in retrospect. :)

To use R's vector logic I took out the loop and replaced it with this:

st <-   sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p1 <-   sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p2 <-   sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
year <- rep(1991:2000,1000)

I can now do 100,000 samples almost instantaneous. I knew that vectors were faster, but dang. I presume 100,000 loops would have taken over an hour using a loop and the vector approach takes <1 second. Just for kicks I made the vectors a million. It took ~2 seconds to complete. Since I must test to failure, I tried 10mm but ran out of memory on my 2GB laptop. I switched over to my Vista 64 desktop with 6GB ram and created vectors of length 10mm in 17 seconds. 100mm made things fall apart as one of the vectors was over 763mb which resulted in an allocation issue with R.

Vectors in R are amazingly fast to me. I guess that's why I am an economist and not a computer scientist.

JD Long
They look cool, never having seen the R language before.
Joe Philllips
JD: Investigate do.call, sapply, lapply, and tapply. These were turning points in R for me. Anonymous functions are useful too.
Vince
+2  A: 

To answer your question about why the loop of 10000 took much longer than your loop of 1000:

I think the primary suspect is the concatenations that are happening every loop. As the data gets longer R is probably copying every element of the vector into a new vector that is one longer. Copying a small (500 elements on average) data set 1000 times is fast. Copying a larger (5000 elements on average) data set 10000 times is slower.

David Locke
That is exactly it. Thank you for pointing that out.
JD Long
today I figured out the faster way to add elements to a vector: appendso the year vector would now look like years <-append(years,year, after=length(years))
JD Long
That's unlikely to be much faster - you need to preallocate.
hadley