tags:

views:

233

answers:

3

I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:

ID                     DATE1             DATE2            VALTYPE        VALUE
'abcd1233'         2009-11-12        2009-12-23           'TYPE1'        123.45
...

VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:

ID                     DATE1             DATE2            VALUE.TYPE1  VALUE.TYPE2
'abcd1233'             2009-11-12        2009-12-23       123.45           NA
...

The data frame has more than 4,500,000 observations (although about 70% of VALUEs are NA). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.

These are my experiences so far:

  • Using vanilla reshape() method:

    tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");

RESULT: Error: cannot allocate vector of size 4.8 Gb

  • Using cast() method of reshape package:

    tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);

RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.

  • Using by() and merge():

    sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);

RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).

To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?

+1  A: 

A useful trick is to combine the id variables into a character vector and then do the reshape.

tbl$NEWID <- with(tbl, paste(ID, DATE1, DATE2, sep=";"))
tbl2 <- recast(tbl2, NEWID ~ VALTYPE, measure.var="VALUE")

It's about 40% faster in a problem of similar size in my intel core2 duo 2.2ghz macbook.

Eduardo Leoni
Nope, using `recast()` shows same problem as with `cast()` method above - the process went over 5 Gb of virtual memory so I killed it after about 1 hour.
Alexander L. Belikoff
+1  A: 

What about doing this in a non-R-like manner? I assume you have a TYPE1 and a TYPE2 row for each value of ID,DATE1,DATE2? Then sort the dataframe by those variables, and write a big for loop. You can repeatedly do rbind() operations to build the table, or you could try to pre-allocate the table (maybe) and just assign the VALUE.TYPE1 and VALUE.TYPE2 slots with [<-, which should do the assignment in-place.

(Note that if you're using rbind(), I believe that it's inefficient if you have any factor variables, so make sure everything is a character instead!)

Harlan
Without `rbind` and loops: `tbl <- tbl[with(tbl,order(ID,DATE1,DATE2,VALTYPE)),];tbl_out <- tbl[seq(1,nrow(tbl),by=2),-4];names(tbl_out)[4] <- "VALUE.TYPE1";tbl_out$VALUE.TYPE2 <- tbl$VALUE[seq(2,nrow(tbl),by=2)];`
Marek
We can't really assume that there are exactly 2 entries for each ID/DATEs. This will immediately break Marek's code above. Also, it would be even more fragile than my working `by()/merge()` code in the body of the question.Overall, I don't have any problem with the loop approach except that I don't understand why the function specifically for that purpose (that is `reshape()` fails on such a trivial problem)
Alexander L. Belikoff
+1  A: 

Maybe you could use the cat() function?

Karsten W.