ansaurus

Question

Is R's apply family more than syntactic sugar

Answer 1

+17 A:

Shane 2010-02-16 20:15:10

Most multi core packages for R also implement parallelization through the `apply` family of functions. Therefore structuring programs so they use apply allows them to be parallelized at a very small marginal cost.

Sharpie 2010-02-16 21:38:50

Thank you both !

steffen 2010-02-17 08:40:33

Sharpie - thank you for that!Any idea for an example showing that (on windows XP) ?

Tal Galili 2010-02-17 11:39:29

I would suggest looking at the `snowfall` package and trying the examples in their vignette. `snowfall` builds on top of the `snow` package and abstracts the details of parallelization even further making it dead simple to execute parallelized `apply` functions.

Sharpie 2010-02-19 03:31:10

Answer 2

+5 A:

Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :

set.seed(1)  #for reproducability of the results

# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions 
#levels() and length() don't have to be called more than once.
  ylev <- levels(y)
  zlev <- levels(z)
  n <- length(ylev)
  p <- length(zlev)

  out <- matrix(NA,ncol=p,nrow=n)
  for(i in 1:n){
      for(j in 1:p){
          out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
      }
  }
  rownames(out) <- ylev
  colnames(out) <- zlev
  return(out)
}

# Used on the generated data
forloop(X,Y,Z)

# The same using tapply
tapply(X,list(Y,Z),mean)

Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :

> system.time(forloop(X,Y,Z))
   user  system elapsed 
   0.94    0.02    0.95 

> system.time(tapply(X,list(Y,Z),mean))
   user  system elapsed 
   0.06    0.00    0.06

There you go. What did I win? ;-)

Joris Meys 2010-08-27 12:51:40

You won an up-vote from me. :)

Shane 2010-08-27 13:29:03

aah, so sweet :-) I was actually wondering if anybody would ever come across my rather late answer.

Joris Meys 2010-08-27 15:20:06

I always sort by "active". :) Not sure how to generalize your answer; sometimes `*apply` is faster. But I think that the more important point is the *side effects* (updated my answer with an example).

Shane 2010-08-30 18:25:08

I think that apply is especially faster when you want to apply a function over different subsets. If there is a smart apply solution for a nested loop, I guess the apply solution will be faster too. In most cases apply doesn't gain much speed I guess, but I definitely agree on the side effects.

Joris Meys 2010-08-30 21:57:25

ansaurus

tags:

views:

answers:

Is R's apply family more than syntactic sugar

related questions