views:

307

answers:

4

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

How I typically do this in R is something like the following:

data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups

results <- matrix(rep("NA",300),ncol=3,nrow=100)

for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])

#Run various stats on tempmat (correlations, regressions, etc), checking to
#make sure this specific group doesn't have NAs in the variables I'm using
#and assign results to x, y, and z, for example.

results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

EDIT:

Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:

>system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32

> system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94

UPDATE:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):

data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)

results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))

for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]

results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")

}

I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in mid-afternoon. Figure 5-7 hours.

Restructuring my code to use data.table....

data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)

testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}

system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])

user system elapsed
16.41 0.05 17.44

Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.

UPDATE 2: Running the code using subsets finally finished up again:

   user     system   elapsed  
17575.79  4247.41   23477.00

UPDATE 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.

Using the same variables and setup as before...

>data.mat <- read.csv("//home....")
>system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))

  user  system elapsed  
250.25    7.35  272.09  
+4  A: 

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

hadley
Thanks for the response! That was my impression from skimming through the manual for the plyr package awhile back; that it generally made things simpler to write, but wouldn't provide much in the performance realm. I'd probably use plyr, but I'm in a workgroup where most aren't very experienced with R and the for loop along with the explicitness of what's happening inside seems to be pretty intuitive to read. Maybe I'll give plyr a try though and see if it provides any benefits. Every little bit helps :)
Adam
Also, according to some code I wrote in the above format that I used Rprof on, the vast majority of time spent in the above code is spent on subsetting the dataset.
Adam
Exactly how big is `data.mat`?
hadley
Talking of speed, data.table is much faster - see answer below.
+2  A: 

You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and assume that vectorizing will give you a performance boost--like it does in other langauges (e.g., Python+NumPy, MATLAB).

An example:

# small function to time the results:
time_this = function(...) {
  start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent()))); 
  end.time = Sys.time(); print(end.time - start.time)
}

# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)

# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )

print(x)    # returns => 0.5312099
print(y)    # returns => 0.661242

The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)

But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':

z = time_this(rowMeans(a))
print(z)    # returns => 0.03679609

An order of magnitude improvement versus the 'for' loop (and the vectorized version).

The other members of the apply family are not just wrappers over a native 'for' loop.

a = abs(floor(10*rnorm(1e6)))

time_this(sapply(a, sqrt))
# returns => 6.64 secs

time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs

'sapply' is about 5x slower compared with a 'for' loop.

Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.

doug
Using apply is not vectorizing.
Shane
yep--will edit w/ example from lapply, sapply, etc.
doug
Forgive me if I misunderstood, but don't the results show that using sapply gives a 5x performance decrement rather than a boost? That's how it turned out when I ran your code fragments. In any case, thank you for the detailed example of applys not always helping out. I don't think I have to worry about that in this case though, as unless I'm mistaken, the apply family doesn't apply (ha!) to the situation where you're subsetting chunks of the dataset and performing some sort of statistical operation on that chunk because it's not going down the vector and doing the same thing on all items/rows
Adam
thanks--typo. edited answer.
doug
Doug, why the `time_this()` business when there is `system.time()`? And why not several reruns? I often do something like 'mean(replicate(N, system.time(someRexpressionHere)["elapsed"]), trim=0.1)' which drops the best and worst times and averages over the remainder. One could argue for summary() or median() or ... but the point is: one data point, as you show here, is not that informative as you may get hit by function loading times etc which are really one-offs.
Dirk Eddelbuettel
agreed, thanks Dirk.
doug
+2  A: 

Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.

Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.

zzr
+1  A: 

Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.

dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column 

Three statements of this kind and a cbind at the end - that is all you need. You can also use dt do some action for only one ID without a subset command in an new syntax:

result.sd.oneiD<- dt[ID="oneID",sd(my_column)]  

The first statment refers to rows (i), the second to columns (j).

If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"... The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package. Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.