In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.
Take for example the data.frame
foo
which I've used table()
to summarize, followed by as.data.frame()
to obtain the frequency counts.
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)
This results in the following frequency count for bar
x y Freq
1 a ab 1
2 b ab 0
3 a ac 1
4 b ac 0
5 a ad 1
6 b ad 0
7 a ae 0
8 b ae 1
9 a fx 0
10 b fx 1
11 a fy 0
12 b fy 1
The problem I'm running into is when there are many levels of x
and y
, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F
, however this doesn't completely solve the problem.