ansaurus

Question

how to speed up this R code

Answer 1

+5 A:

Just to summarize some of the points from the comments:

Before you start to optimize, you should have some sense for "acceptable" performance. Depending upon the required performance, you can then be more specific about how to improve the code. For instance, at some threshold, you would need to stop using R and move onto a compiled language.
Once you have an expected run-time, you can profile your existing code to find potential bottlenecks. R has several mechanisms for this, including Rprof (there are examples on stackoverflow if you search for [r] + rprof).
plyr is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.

Shane 2010-10-19 19:49:18

Thanks for the summary. And thanks to everyone who contributed such useful information. I have a lot of reading to do!

dnagirl 2010-10-20 19:03:48

Answer 2

+6 A:

Just using aggregate is quite a bit faster...

> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
> 
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   1.89    0.00    1.89 
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
   user  system elapsed 
   5.06    0.00    5.06 
> 
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
> 
> identical(ag.median, df.median)
[1] TRUE

Joshua Ulrich 2010-10-19 19:51:24

`aggregate` fixes this problem handily.

dnagirl 2010-10-20 19:01:48

Answer 3

+1 A:

Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,

> system.time(table(BB$year))
    user  system elapsed 
   0.007   0.002   0.009 

> system.time(ddply(BB, .(year), 'nrow'))
    user  system elapsed 
   0.183   0.005   0.189

Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:

dt1 = data.table(my_dataframe)

doug 2010-10-19 20:57:43

Answer 4

+2 A:

To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:

> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
   user  system elapsed 
   3.472   0.020   3.615 
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
   user  system elapsed 
   0.936   0.008   1.006

VitoshKa 2010-10-19 21:11:39

very interesting! I'll keep that in mind. Unfortunately, this data has to compare medians.

dnagirl 2010-10-20 13:02:49

Answer 5

+2 A:

The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.

x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
   user  system elapsed 
   3.47    0.33    3.80

system.time(for(i in 1:1e2) median(y))
   user  system elapsed 
   5.03    0.26    5.29

For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).

Richie Cotton 2010-10-20 13:57:29

ansaurus

tags:

views:

answers:

how to speed up this R code

related questions