tags:

views:

113

answers:

3

I have two apply functions excecuting the average and standard deviation across the first two dimensions on a large three dimentional array (437216,8,3). It takes 16 minutes to complete on Rx32. It's the first of many large arrays in a database we are applying this script on a regular basis. Any thoughts on how to speed up runtime?

A: 

EDIT : After the code provided by OP, the problem became clear. Trick is to convert it to a dataframe :

> x = array(rnorm(437216*8*3), dim = c(437216,8,3))

> system.time(apply(x,1:2,mean))
   user  system elapsed 
 107.06    0.18  107.34 
 # This is run on a new quadcore i7, so it's not a slow machine...

> Tmp <- data.frame(V1=as.vector(x[,,1]),
+             V2=as.vector(x[,,2]),
+             V3= as.vector(x[,,3]))

> system.time({
+     Means <- rowMeans(Tmp)
+     Sd <- sqrt(rowSums((Tmp-Means)^2)/(3-1))
+ })
   user  system elapsed 
   6.72    0.40    7.12 

To get the results in the correct matrix :

Means <- matrix(Means,ncol=8)
Sd <- matrix(Sd,ncol=8)

Proof of concept :

x = array(rnorm(10*8*3), dim = c(10,8,3))

m1 <- apply(x,1:2,mean)
sd1 <- apply(x,1:2,sd)

Tmp <- data.frame(V1=as.vector(x[,,1]),
            V2=as.vector(x[,,2]),
            V3= as.vector(x[,,3]))
m2 <- rowMeans(Tmp)

sd2 <- sqrt(rowSums((Tmp-m2)^2)/2)

m2 <-matrix(m2,ncol=8)
sd2 <- matrix(sd2,ncol=8)

> all.equal(m1,m2)
[1] TRUE

> all.equal(sd1,sd2)
[1] TRUE
Joris Meys
+1  A: 

That seems very slow. On my machine

set.seed(10)

x = array(rnorm(437216*8*3), dim = c(437216,8,3))

system.time(apply(x, 1, mean))

takes

   user  system elapsed 
 23.903   0.263  24.522 

FWIW,

system.time(apply(x, 2, mean))
       user  system elapsed 
      0.546   0.274   0.841 


system.time(apply(x, 3, mean))
   user  system elapsed 
  0.516   0.267   0.790 

What is your sessionInfo()?

sessionInfo()
R version 2.11.1 (2010-05-31) 
i386-apple-darwin9.8.0 

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] cimis_0.1-3    RLastFM_0.1-4  RCurl_1.4-2    bitops_1.0-4.1 XML_3.1-0      lattice_0.18-8

loaded via a namespace (and not attached):
[1] grid_2.11.1  tools_2.11.1
Greg
A: 

My systemInfo() is as follows:

sessionInfo() R version 2.11.0 (2010-04-22) x86_64-pc-mingw32

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] stats     graphics  grDevices utils     datasets methods   base

other attached packages: [1] abind_1.1-0   RSQLite_0.9-1 DBI_0.2-5

The apply function is applied across both the first and second margin (1:2) and the system time is below, which I believe is what is causing it run so long. I ran it on a better computer/system (listed above) and cut the run time some (below), but it still seems like it's taking longer than it should:

>  system.time(apply(x,1:2,mean))   
user  system elapsed
311.56    0.30  311.88
> system.time(apply(x,1:2,sd))    
user  system elapsed
505.92    0.21  506.81

I'll look into converting it to a data.frame and unlisting it as in the second suggestion. Thanks for all the help!

try : 'TMP <- data.frame(V1=as.vector(x[,,1]), V2=as.vector(x[,,2]), V3= as.vector(x[,,3]))'. That should convert it to a data frame, and then you can use the provided code.
Joris Meys
@curransk : I checked it and it works considerably faster than the original code. See the edited version of my previous answer.
Joris Meys
@Joris Meys -- Thank you so much, this sped things up big time. I'm down to less than a minute for some of my larger arrays. I'm new and have no "reputation" yet. Otherwise, I'd give you a big thumbs up. Thanks again!
@curransk : if you're the topic starter (Krissi?), you can accept a correct answer by clicking on the "V" sign on the left.
Joris Meys