views:

82

answers:

1

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.

My original call, slow but it works:

df.median<-ddply(exp, 
                 .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), 
                 na.rm=TRUE)

With idata.frame, Error: is.data.frame(df) is not TRUE

library(plyr)
df.median<-ddply(idata.frame(exp), 
                 .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), 
                 na.rm=TRUE)

So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:

bb.median<-ddply(idata.frame(baseball), 
                 .(id,year,team), 
                 numcolwise(median), 
                 na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE

Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?

ETA:

I also tried:

groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')

i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable

which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.

A: 

Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().

In any case, this cuts about 20% off the time (on my system):

system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
) 

Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Brandon Bertelsen
that's interesting that specifying the columns speeds things up (it does so on my system too), but so far the fastest solution is to use `aggregate()`. In any event, the reason I used my old slow call in this question was because it caused `idata.frame()` to choke. I have a lot of calls that use the exp data frame, and I thought if I could substitute an idata.frame, it might speed up all the calls significantly.
dnagirl
You can use `idata.frame` with this call - but it only gives about a 10% speed up because you're not making that many splits.
hadley
And yes, `aggregate` will currently beat `ddply` + `colwise`. Thinking about how to do better for the next version.
hadley