tags:

views:

106

answers:

2

I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt be of type data.table.

sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
     my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}

If I run this, I get errors:

Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
  by must evaluate to list

Yet the following works without error:

my.dt[,sum(dependent.variable), by=sex]

I see why the error is occurring, but I do not see how to use a vector with the by parameter.

+2  A: 

Hi, Ryan, I did two changes to your original code:

sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21) 

age<-as.factor(age)

dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)

for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]]) 
}

Numerical vector age should be forced into factors. As to by parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.

Vulpecula
That's interesting, and the design behind the "by" seems a bit counterintuitiv, but your code works mostly.Only one small problem (minor, and easy to fix by just assigning names in the loop), the column names are not updated.For variable sex, names are: sex V1For variable age, names are: sex V1
Ryan Rosario
+1  A: 

The answer above looks almost right. Strictly speaking the by needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age could also be coerced to integer using as.integer(). This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.

The idea behind by being a list() of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by. A common one is to aggregate by month; for example :

DT[,sum(col1), by=list(region,month(datecol))]

or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :

DT[,sum(col1), by=list(region,month=datecol%/%100L)]

Notice how you can name the columns inside the list() like that.

To define and reuse complex grouping expressions :

e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]

Or if you don't want to re-evaluate the by expressions each time, you can save the result once and reuse the result for efficiency; if the by expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :

byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]

Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.