tags:

views:

109

answers:

3

Is there a way that this can be improved, or done more simply?

means.by<-function(data,INDEX){
  b<-by(data,INDEX,function(d)apply(d,2,mean))
  return(structure(
    t(matrix(unlist(b),nrow=length(b[[1]]))),
      dimnames=list(names(b),col.names=names(b[[1]]))
  ))
}

The idea is the same as a SAS MEANS BY statement. The function 'means.by' takes a data.frame and an indexing variable and computes the mean over the columns of the data.frame for each set of rows corresponding to the unique values of INDEX and returns a new data frame with with the row names the unique values of INDEX.

I'm sure there must be a better way to do this in R but I couldn't think of anything.

+1  A: 

You want tapply or ave, depending on how you want your output:

> Data <- data.frame(grp=sample(letters[1:3],20,TRUE),x=rnorm(20))
> ave(Data$x, Data$grp)
 [1] -0.3258590 -0.5009832 -0.5009832 -0.2136670 -0.3258590 -0.5009832
 [7] -0.3258590 -0.2136670 -0.3258590 -0.2136670 -0.3258590 -0.3258590
[13] -0.3258590 -0.5009832 -0.2136670 -0.5009832 -0.3258590 -0.2136670
[19] -0.5009832 -0.2136670
> tapply(Data$x, Data$grp, mean)
         a          b          c 
-0.5009832 -0.2136670 -0.3258590 

# Example with more than one column:
> Data <- data.frame(grp=sample(letters[1:3],20,TRUE),x=rnorm(20),y=runif(20))
> do.call(rbind,lapply(split(Data[,-1], Data[,1]), mean))
             x         y
a -0.675195494 0.4772696
b  0.270891403 0.5091359
c  0.002756666 0.4053922
Joshua Ulrich
Neither of those will do what I want, and are essentially the same thing. In fact the function 'by' which I am using is simply a wrapper for tapply. The idea is that I give a data.frame apply a function over the columns and get a data.frame or matrix back.
Andrew Redd
My bad. My example only has one column.
Joshua Ulrich
+2  A: 

Does the aggregate function do what you want?

If not, look at the plyr package, it gives several options for taking things apart, doing computations on the pieces, then putting it back together again.

You may also be able to do this using the reshape package.

Greg Snow
yes aggregate was what I was looking for thank you.
Andrew Redd
+2  A: 

With plyr

library(plyr)
df <- ddply(x, .(id),function(x) data.frame(
mean=mean(x$var)
))
print(df)

Update:

data<-data.frame(I=as.factor(rep(letters[1:10],each=3)),x=rnorm(30),y=rbinom(30,5,.5))
ddply(data,.(I), function(x) data.frame(x=mean(x$x), y=mean(x$y)))

See, plyr is smart :)

Update 2:

In response to your comment, I believe cast and melt from the reshape package are much simpler for your purpose.

cast(melt(data),I ~ variable, mean)
Brandon Bertelsen
Can this scale to a data.frame with 100 columns? Writing data.frame(x=mean(x$X),...) is not practical. I don't mean to be negative or derogatory, but that is the context of my situation, and so am looking for the best solution that can scale up well.
Andrew Redd
The answer is yes, you have a whole function to work with inside of ddply. However, I think cast and melt are more efficient for this purpose. I have updated my response.
Brandon Bertelsen