tags:

views:

438

answers:

5

Suppose I want perform a simulation using the following function:

fn1 <- function(N) {
  res <- c()
  for (i in 1:N) {
    x <- rnorm(2)
    res <- c(res,x[2]-x[1])
  }
  res
}

For very large N, computation appears to hang. Are there better ways of doing this?

(Inspired by: https://stat.ethz.ch/pipermail/r-help/2008-February/155591.html)

+2  A: 

For loops in R are notoriously slow, but here there's another issue. It's much faster to preallocate the results vector, res, rather append to res at each iteration.

Below we can compare the speed of the above version with a version that simply starts with a vector, res, of length N and changes the ith element during the loop.

fn1 <- function(N) {
  res <- c()
  for (i in 1:N) {
     x <- rnorm(2)
     res <- c(res,x[2]-x[1])
  }
  res
}
fn2 <- function(N) {
  res <- rep(0,N)
  for (i in 1:N) {
     x <- rnorm(2)
     res[i] <- x[2]-x[1]
  }
  res
}
> N <- 50000
> system.time(res1 <- fn1(N))
   user  system elapsed 
  6.568   0.256   6.826 
> system.time(res2 <- fn2(N))
   user  system elapsed 
  0.452   0.004   0.496

Also, as Sharpie points out, we can make this slightly faster by using R functions like apply (or its relatives, sapply and lapply).

fn3 <- function(N) {
  sapply( 1:N, function( i ){ x <- rnorm(2); return( x[2] - x[1] ) } )
}
> system.time(res3 <- fn3(N))
   user  system elapsed 
  0.397   0.004   0.397
Christopher DuBois
What's wrong with the second answer in that R list thread: res <- rnorm(10^6)-rnorm(10^6)?
ars
@ars: You are absolutely right - That gives the fastest solution (by an order of magnitude). The best advice would be 1. Use functions that naturally work on vectors (like rnorm does); 2. Failing that, use an *apply function; 3. Failing that, use a for loop with preallocation.
Richie Cotton
+4  A: 

The efficiency of loops can be increased tremendously in R through the use of the apply functions which essentially process whole vectors of data at once rather than looping through them. For the loop shown above, there are two basic operations happening during each iteration:

# A vector of two random numbers is generated
x <- rnorm( 2 )

# The difference between those numbers is calculated
x[2] - x[1]

In this case the appropriate function would be sapply(). sapply() operates on a list of objects, such as the vector generated by the loop statement 1:N and returns a vector of results:

sapply( 1:N, function( i ){ x <- rnorm(2); return( x[2] - x[1] ) } )

Note that the index value i is available during the function call and successively takes on the values between 1 and N, however it is not needed in this case.

Getting into the habit of recognizing where apply can be used over for is a very valuable skill- many R libraries for parallel computation provide plug-and-play parallelization through apply functions. Using apply can often allow access to significant performance increases on multicore systems with zero refactoring of code.

Sharpie
+1  A: 

Sometimes loop is not needed. Since rnorm gives iid sample (theoretically), you will achieve the same result (sampling X-Y where X and Y are N(0,1)) by doing:

res <- rnorm(N)-rnorm(N)
mpiktas
+1  A: 

Expanding on my comment to chris_dubois's answer, here's some timing information:

> system.time(res <- rnorm(50000) - rnorm(50000))
user  system elapsed
0.06    0.00    0.06

Contrast this with fn3 from that same answer:

> system.time(res3 <- fn3(50000))
user  system elapsed
1.33    0.01    1.36

The first thing to notice is that my laptop is slower than chris_dubois's machine. :)

The second, and more important, point is that the vector approach, quite applicable here, is an order of magnitude faster. (Also pointed out by Richie Cotton in a comment to the same answer).

This brings me to the final point: it is a myth that apply and its friends are much faster than for loops in R. They're on the same order in most measurements I've seen. Because they're just for loops behind the scenes. See also this post:

http://yusung.blogspot.com/2008/04/speed-issue-in-r-computing-apply-vs.html

According to Professor Brian Ripley, "apply() is just a wrapper for a loop." The only advantage for using apply() is that it makes your code neater!

Exactly. You should use apply if it's more expressive, especially if you're programming in a functional style. Not because it's faster.

ars
Christopher DuBois
Hey, that's a good idea -- I'm woefully ignorant of that wiki. I know I've come across a couple loop to vector optimizations by reading code written by others -- most recently looking at some code by Heagerty for constructing variograms. I tend to assume it's common knowledge and not noteworthy for others, but it's better to err on the side of documenting it. I'll go through my files and find something specific to add to the wiki, hopefully by this weekend. Do you have any thoughts on how to structure it? Should we just create a "vectorizing tips" page and break it out as necessary?
ars
A: 

Maybe the most efficient replacement for your function would simply be:

fn <- function(n) rnorm(N,0,sqrt(2))

which is twice as fast as taking difference of iid normal variates. More generally though, if your goal is to run simple simulations, vector/array preallocation and calls to native functions speed up the process greatly.

If you want to perform monte-carlo simulations for statistical estimations (e.g., MCMC), R has a number of native packages. For general stochastic simulation, I am not aware of R packages but you may want to try Simpy (http://simpy.sourceforge.net/), which is excellent.

gappy