I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:
ID DATE1 DATE2 VALTYPE VALUE
'abcd1233' 2009-11-12 2009-12-23 'TYPE1' 123.45
...
VALTYPE
is a string and is a factor with only 2 values (say TYPE1
and TYPE2
). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:
ID DATE1 DATE2 VALUE.TYPE1 VALUE.TYPE2
'abcd1233' 2009-11-12 2009-12-23 123.45 NA
...
The data frame has more than 4,500,000 observations (although about 70% of VALUE
s are NA
). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.
These are my experiences so far:
Using vanilla
reshape()
method:tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");
RESULT: Error: cannot allocate vector of size 4.8 Gb
Using
cast()
method ofreshape
package:tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);
RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.
Using
by()
andmerge()
:sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);
RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).
To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?