The answer above looks almost right. Strictly speaking the by
needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age
could also be coerced to integer using as.integer()
. This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by
is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.
The idea behind by
being a list()
of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by
. A common one is to aggregate by month; for example :
DT[,sum(col1), by=list(region,month(datecol))]
or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :
DT[,sum(col1), by=list(region,month=datecol%/%100L)]
Notice how you can name the columns inside the list() like that.
To define and reuse complex grouping expressions :
e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]
Or if you don't want to re-evaluate the by
expressions each time, you can save the result once and reuse the result for efficiency; if the by
expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :
byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]
Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.