tags:

views:

151

answers:

5

I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:

library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), na.rm=TRUE)

according to system.time(), it takes about this long to run:

   user  system elapsed 
   5.16    0.00    5.17

This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?

+5  A: 

Just to summarize some of the points from the comments:

  1. Before you start to optimize, you should have some sense for "acceptable" performance. Depending upon the required performance, you can then be more specific about how to improve the code. For instance, at some threshold, you would need to stop using R and move onto a compiled language.
  2. Once you have an expected run-time, you can profile your existing code to find potential bottlenecks. R has several mechanisms for this, including Rprof (there are examples on stackoverflow if you search for [r] + rprof).
  3. plyr is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.
Shane
Thanks for the summary. And thanks to everyone who contributed such useful information. I have a lot of reading to do!
dnagirl
+6  A: 

Just using aggregate is quite a bit faster...

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE
Joshua Ulrich
`aggregate` fixes this problem handily.
dnagirl
+1  A: 

Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189 

Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:

dt1 = data.table(my_dataframe)
doug
+2  A: 

To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006 
VitoshKa
very interesting! I'll keep that in mind. Unfortunately, this data has to compare medians.
dnagirl
+2  A: 

The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).

Richie Cotton