views:

269

answers:

1

Good morning,

I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.

Hence, I have been trying to use as much vectorized functions as possible.

However, I am still wondering something.

What is costly in R is not the loop itself right? I mean, the problem arises when you start modifying variables within the loop for example is that correct?

Hence I was thinking, what if you simply have to run a function on each element (you actually do not care about the result). For example to write data in a database. What should you do?

1) use mapply without storing the result anywhere?

2) do a loop over the vector and only apply f(i) to each element?

3) is there a better function I might have missed?

(that's of course assuming your function is not optimally vectorized).

What about the foreach package? Have you experienced any performance improvement by using it?

+5  A: 

Just a couple of comments. A for loop is roughly as fast as apply and its variants, and the real speed-ups come when you vectorise your function as much as possible (that is, using low-level loops, rather than apply, which just hides the for loop). I'm not sure if this is the best example, but consider the following:

> n <- 1e06
> sinI <- rep(NA,n)
> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
  3.316   0.000   3.358 
> system.time(sinI <- sapply(1:n,sin))
   user  system elapsed 
  5.217   0.016   5.311 
> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 
> system.time(sinI <- sin(1:n))
   user  system elapsed 
  0.056   0.000   0.057 

In one of the comments below, Marek points out that the time consuming part of the for loop above is actually the ]<- part:

> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 

The bottlenecks which can't immediately be vectorised can be rewritten in C or Fortran, compiled with R CMD SHLIB, and then plugged in with .Call, .C or .Fortran.

Also, see these links for more info about loop optimisation in R. Also check out the article "How Can I Avoid This Loop or Make It Faster?" in R News.

nullglob
isn't the apply function handling the loop better from its C implementation still?The question is in fact general, is it better to use Reduce that implementing a simple loop (for example) in your opinion?
JSmaga
In `sapply` version most of time is spend on post-processing results. When you do ` system.time(sinI <- unlist(lapply(1:n,sin),FALSE,FALSE))` you should get fastest version (not from `sin(1:n)` of course). In `for` loop time-consuming is `[<-`, check `system.time(for(i in 1:n) sin(i))` (in this case is useless cause drop results).
Marek