I have two apply functions excecuting the average and standard deviation across the first two dimensions on a large three dimentional array (437216,8,3). It takes 16 minutes to complete on Rx32. It's the first of many large arrays in a database we are applying this script on a regular basis. Any thoughts on how to speed up runtime?
A:
EDIT : After the code provided by OP, the problem became clear. Trick is to convert it to a dataframe :
> x = array(rnorm(437216*8*3), dim = c(437216,8,3))
> system.time(apply(x,1:2,mean))
user system elapsed
107.06 0.18 107.34
# This is run on a new quadcore i7, so it's not a slow machine...
> Tmp <- data.frame(V1=as.vector(x[,,1]),
+ V2=as.vector(x[,,2]),
+ V3= as.vector(x[,,3]))
> system.time({
+ Means <- rowMeans(Tmp)
+ Sd <- sqrt(rowSums((Tmp-Means)^2)/(3-1))
+ })
user system elapsed
6.72 0.40 7.12
To get the results in the correct matrix :
Means <- matrix(Means,ncol=8)
Sd <- matrix(Sd,ncol=8)
Proof of concept :
x = array(rnorm(10*8*3), dim = c(10,8,3))
m1 <- apply(x,1:2,mean)
sd1 <- apply(x,1:2,sd)
Tmp <- data.frame(V1=as.vector(x[,,1]),
V2=as.vector(x[,,2]),
V3= as.vector(x[,,3]))
m2 <- rowMeans(Tmp)
sd2 <- sqrt(rowSums((Tmp-m2)^2)/2)
m2 <-matrix(m2,ncol=8)
sd2 <- matrix(sd2,ncol=8)
> all.equal(m1,m2)
[1] TRUE
> all.equal(sd1,sd2)
[1] TRUE
Joris Meys
2010-09-10 16:11:35
+1
A:
That seems very slow. On my machine
set.seed(10)
x = array(rnorm(437216*8*3), dim = c(437216,8,3))
system.time(apply(x, 1, mean))
takes
user system elapsed
23.903 0.263 24.522
FWIW,
system.time(apply(x, 2, mean))
user system elapsed
0.546 0.274 0.841
system.time(apply(x, 3, mean))
user system elapsed
0.516 0.267 0.790
What is your sessionInfo()?
sessionInfo()
R version 2.11.1 (2010-05-31)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] cimis_0.1-3 RLastFM_0.1-4 RCurl_1.4-2 bitops_1.0-4.1 XML_3.1-0 lattice_0.18-8
loaded via a namespace (and not attached):
[1] grid_2.11.1 tools_2.11.1
Greg
2010-09-10 18:01:41
A:
My systemInfo() is as follows:
sessionInfo() R version 2.11.0 (2010-04-22) x86_64-pc-mingw32
locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] abind_1.1-0 RSQLite_0.9-1 DBI_0.2-5
The apply function is applied across both the first and second margin (1:2) and the system time is below, which I believe is what is causing it run so long. I ran it on a better computer/system (listed above) and cut the run time some (below), but it still seems like it's taking longer than it should:
> system.time(apply(x,1:2,mean))
user system elapsed
311.56 0.30 311.88
> system.time(apply(x,1:2,sd))
user system elapsed
505.92 0.21 506.81
I'll look into converting it to a data.frame and unlisting it as in the second suggestion. Thanks for all the help!
try : 'TMP <- data.frame(V1=as.vector(x[,,1]), V2=as.vector(x[,,2]), V3= as.vector(x[,,3]))'. That should convert it to a data frame, and then you can use the provided code.
Joris Meys
2010-09-13 16:40:03
@curransk : I checked it and it works considerably faster than the original code. See the edited version of my previous answer.
Joris Meys
2010-09-13 16:55:13
@Joris Meys -- Thank you so much, this sped things up big time. I'm down to less than a minute for some of my larger arrays. I'm new and have no "reputation" yet. Otherwise, I'd give you a big thumbs up. Thanks again!
2010-09-13 20:00:43
@curransk : if you're the topic starter (Krissi?), you can accept a correct answer by clicking on the "V" sign on the left.
Joris Meys
2010-09-14 08:35:51