tags:

views:

335

answers:

2

I've learned R by toying, and I'm starting to think that I'm abusing the tapply function. Are there better ways to do some of the following actions? Granted, they work, but as they get more complex I wonder if I'm losing out on better options. I'm looking for some criticism, here:

tapply(var1, list(fac1, fac2), mean, na.rm=T)

tapply(var1, fac1, sum, na.rm=T) / tapply(var2, fac1, sum, na.rm=T)

cumsum(tapply(var1, fac1, sum, na.rm=T)) / sum(var1)

Update: Here's some example data...

     var1    var2 fac1           fac2
1      NA  275.54   10      (266,326]
2      NA  565.89   10      (552,818]
3      NA  815.41    6      (552,818]
4      NA  281.77    6      (266,326]
5      NA  640.24   NA      (552,818]
6      NA   78.42   NA     [78.4,266]
7      NA 1027.06   NA (818,1.55e+03]
8      NA  355.20   NA      (326,552]
9      NA  464.52   NA      (326,552]
10     NA 1397.11   10 (818,1.55e+03]
11     NA  229.82   NA     [78.4,266]
12     NA  542.77   NA      (326,552]
13     NA  829.32   NA (818,1.55e+03]
14     NA  284.78   NA      (266,326]
15     NA  194.97   10     [78.4,266]
16     NA  672.55    8      (552,818]
17     NA  348.01   10      (326,552]
18     NA 1550.79    9 (818,1.55e+03]
19 101.98  101.98    4     [78.4,266]
20     NA  292.80    6      (266,326]

Update data dump:

structure(list(var1 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 101.98, NA), var2 = c(275.54, 565.89, 815.41, 281.77, 640.24, 78.42, 1027.06, 355.2, 464.52, 1397.11, 229.82, 542.77, 829.32, 284.78, 194.97, 672.55, 348.01, 1550.79, 101.98, 292.8), fac1 = c(10L, 10L, 6L, 6L, NA, NA, NA, NA, NA, 10L, NA, NA, NA, NA, 10L, 8L, 10L, 9L, 4L, 6L), fac2 = structure(c(2L, 4L, 4L, 2L, 4L, 1L, 5L, 3L, 3L, 5L, 1L, 3L, 5L, 2L, 1L, 4L, 3L, 5L, 1L, 2L), .Label = c("[78.4,266]", "(266,326]", "(326,552]", "(552,818]", "(818,1.55e+03]"), class = "factor")), .Names = c("var1", "var2", "fac1", "fac2"), row.names = c(NA, -20L), class = "data.frame")

+6  A: 

Perhaps the plyr package might be of interest to you?

Steve Lianoglou
Brilliant. It's like ETL for data analysis itself. This also goes a long way to defeating the pivot table data dredging my colleagues are so fond of.Thanks for the suggestion. I'm already rewriting code.
Totovader
A: 

For part 1 I prefer aggregate because it keeps the data in a more R-like one observation per row format.

aggregate(var1, list(fac1, fac2), mean, na.rm=T)

Peter