views:

241

answers:

3

I have a data.frame from this code:

   my_df = data.frame("read_time" = c("2010-02-15", "2010-02-15", 
                                      "2010-02-16", "2010-02-16", 
                                       "2010-02-16", "2010-02-17"), 
                      "OD" = c(0.1, 0.2, 0.1, 0.2, 0.4, 0.5) )

which produces this:

> my_df
   read_time  OD
1 2010-02-15 0.1
2 2010-02-15 0.2
3 2010-02-16 0.1
4 2010-02-16 0.2
5 2010-02-16 0.4
6 2010-02-17 0.5

I want to average the OD column over each distinct read_time (notice some are replicated others are not) and I also would like to calculate the standard deviation, producing a table like this:

> my_df
   read_time  OD        stdev
1 2010-02-15 0.15       0.05
5 2010-02-16 0.3         0.1
6 2010-02-17 0.5         0

Which are the best functions to deal with concatenating such values in a data.frame?

+6  A: 

The plyr package is popular for this, but the base functions by() and aggregate() will also help.

> ddply(my_df, "read_time", function(X) data.frame(OD=mean(X$OD),stdev=sd(X$OD)))
   read_time      OD   stdev
1 2010-02-15 0.15000 0.07071
2 2010-02-16 0.23333 0.15275
3 2010-02-17 0.50000      NA

You can add the missing bit to return 0 instead of NA for the last std.dev.

Also, you don't need the quotes (on the variables) you had in the data.frame construction.

Dirk Eddelbuettel
Thanks Dirk, that works well with the plyr package, but could you tell me how to add another column to the data, I have another column named day in my real dataset. I tried this: ddply(individual_well_series_od, "read_time", function(X) data.frame(od=mean(X$od),stdev=sd(X$od), day=X$day)) but it returns a all of the read_times again. I realize that I'm not applying a function to day, but I read the help but can't see where to put it.
John
And I tried the original code without the "" around the dates, but the dates did not read correctly, so I kept them in, I could have tried to convert them to date objects I presume, but I kept it as this simple example.
John
Not the dates, the variables, ie use data.frame(a=1:3) and not data.frame("a"=1:3)As for adding a variable, you can't -- the ddply call *reduces* several rowns to a single-row summary. If you add an original data column then you get repeats. You got to think this through.
Dirk Eddelbuettel
Or with the built in `summarize` helper function: `ddply(my_df, "read_time", summarise, OD = mean(OD), stdev = sd(OD))`
hadley
summarize (with z) or summarise (with s) or both? ;-) R is quite charming in its support of British and American spelling...
Dirk Eddelbuettel
thanks, it works nicely with summarize too. I also fixed the new variables by including e.g. days = unique(days)
John
But unique(days) could return more than one. You could try head(days,1) or even head(unique(days), 1).
Dirk Eddelbuettel
+1  A: 

This illustrates how you could use aggregate to get the mean and standard deviation by your read_time.

>aggregate(my_df$OD, by=list(my_df$read_time), function(x) mean(x))

     Group.1         x
1 2010-02-15 0.1500000
2 2010-02-16 0.2333333
3 2010-02-17 0.5000000


>aggregate(my_df$OD, by=list(my_df$read_time), function(x) sd(x))
     Group.1          x
1 2010-02-15 0.07071068
2 2010-02-16 0.15275252
3 2010-02-17         NA
andrewj
If you just want an existing function called, you don't have to define your own anonymous function. You can pass the existing function:aggregate(my_df$OD,by=list(my_df$read_time),mean)
Jyotirmoy Bhattacharya
+1  A: 

You can try the package data.table. If you know MySQL it should be very easy for you to get all the functions, otherwise the basics are good enough too ;-)

my_dfdt<-data.table(my_df)
mean<-my_dfdt[,mean(OD), by="read_time"]
sd<-  ..  

you can also join both in one line or to cbind at the end, your call of style

Another advantage: it is extremely fast, if you have large samples. Very fast...see documentation why.