views:

222

answers:

3

In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.

Take for example the data.frame foo which I've used table() to summarize, followed by as.data.frame() to obtain the frequency counts.

foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)

This results in the following frequency count for bar

   x  y Freq
1  a ab    1
2  b ab    0
3  a ac    1
4  b ac    0
5  a ad    1
6  b ad    0
7  a ae    0
8  b ae    1
9  a fx    0
10 b fx    1
11 a fy    0
12 b fy    1

The problem I'm running into is when there are many levels of x and y, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F, however this doesn't completely solve the problem.

+1  A: 

Look at the xtabs method in the Matrix package which does sparse cross-tabulation.

Jonathan Chang
Thanks for the suggestion. However, I think `xtabs` only works with two-ways tables. It turns out that in my specific case, I actually have a table that's three-ways.
andrewj
+1  A: 
library(plyr)
ddply(foo, ~ x + y, nrow,.drop=FALSE)
Ian Fellows
+1  A: 

I have this method for fast (sparse) cross tabulation. I think there are possibilities for further optimisation, but it's been good enough for me for large data sets. The key is the use of ninteraction from the plyr package to quickly generate a numeric id for each row.

tab <- function(df, drop = TRUE) {
  id <- plyr::ninteraction(df)
  ord <- order(id)

  df <- df[ord, , drop = FALSE]
  id <- id[ord]

  freq <- rle(id)$lengths
  labels <- unrowname(df[cumsum(freq), , drop = FALSE])

  data.frame(labels, freq)
}
hadley