tags:

views:

56

answers:

4

Friends

I'm trying t set up a matrix or data.frame for a canonical correlation analysis. The original dataset has a column designating one of x conditions and subsequent columns of explanatory variables. I need to set up an array that sets an indicator variable for each condition "x". eg. Columns in df are:

ID cond task1 taskN  
A, x, 12, 14  
B, x, 13, 17  
C, y, 11, 10  
D, z, 10, 13  

here "cond" can be x,y,z,... (can vary, so I don't know how many). This needs to go to:

ID, x, y, z, task1, taskN  
A, 1, 0, 0, 12, 14  
B, 1, 0, 0, 13, 17  
C, 0, 1, 0, 11, 10  
D, 0, 0, 1, 10, 13  

So, I can set up the indicators in an array

iv<-as.data.frame(array(,c(nrow(df),length(levels(cond)))))  

and then cbind this to df, but I can't figure out how to go into the array and set the appropriate indicator to "1" and the rest to "0".

Any suggestions?

Thanks

Jon

+3  A: 

If you code cond as a factor, you can get R to do the expansion you want via model.matrix. The only complication is that to get the coding you chose (dummy variables coding, or sum contrasts in R) we need to change the default constrasts used by R's model formula code.

## data
dat <- data.frame(ID = LETTERS[1:4], cond = factor(c("x","x","y","z")),
                  task1 = c(12,13,11,10), taskN = c(14,17,10,13))
dat

## We get R to produce the dummy variables for us,
## but your coding needs the contr.sum contrasts
op <- options(contrasts = c("contr.sum","contr.poly"))
dat2 <- data.frame(ID = dat$ID, model.matrix(ID ~ . - 1, data = dat))
## Levels of cond
lev <- with(dat, levels(cond))
## fix-up the names
names(dat2)[2:(1+length(lev))] <- lev
dat2

## reset contrasts
options(op)

This gives us:

> dat2
  ID x y z task1 taskN
1  A 1 0 0    12    14
2  B 1 0 0    13    17
3  C 0 1 0    11    10
4  D 0 0 1    10    13

This should scale automatically as the number of levels in cond changes/increases.

HTH

Gavin Simpson
+2  A: 

Another alternative is to use use cast in the reshape package:

library(reshape)
l <- length(levels(dat$cond))
dat2 <- merge(cast(dat,ID~cond),dat)[,c(1:(l+1),(l+3):(ncol(dat)+l))]
dat2[,2:(1+l)] <- !is.na(dat2[,2:(1+l)])

This gives you logical values rather than 0 and 1 though:

> dat2
  ID     x     y     z task1 taskN
1  A  TRUE FALSE FALSE    12    14
2  B  TRUE FALSE FALSE    13    17
3  C FALSE  TRUE FALSE    11    10
4  D FALSE FALSE  TRUE    10    13
James
If you make your last line `dat2[,2:(1+l)] <- as.numeric(!is.na(dat2[,2:(1+l)]))` then you'll get the result the OP wanted.
Gavin Simpson
+1  A: 

That's cool using model.matrix for this. (reshape too.) Always learning something here. A couple more ideas:

indicator1 <- function(groupStrings) {
  groupFactors <- factor(groupStrings)
  colNames <- levels(groupFactors)
  bits <- matrix(0, nrow=length(groupStrings), ncol=length(colNames))
  bits[matrix(c(1:length(groupStrings),
                unclass(groupFactors)), ncol=2)] <- 1
  setNames(as.data.frame(bits), colNames)
}

indicator2 <- function(groupStrings) {
  colNames <- unique(groupStrings)
  bits <- outer(groupStrings, colNames, "==")
  setNames(as.data.frame(bits * 1), colNames)
}

Used as follows

d <- data.frame(cond=c("a", "a", "b"))
d <- cbind(d, indicator2(as.character(d$cond)))
David F
Again, a great example of the greatness of open-source! Thanks so much for your help. The initial solution seemed to work best for me. In case someone else might be interested, here is how I implemented this with my (very large) dataset:
Jon Erik Ween
A: 

Again, a great example of the greatness of open-source! Thanks so much for your help. The initial solution seemed to work best for me. In case someone else might be interested, here is how I implemented this with my (very large) dataset:

 # Load needed libraries if not already so  
if("packages:sciplot" %in% search()) next else library(moments)  

 # Initialize dataframes. DEFINE THE workspace SUBSET TO ANALYZE HERE  
 df<-stroke  

 # Make any necessary modifications to the df  
 df$TrDif <- df$TrBt-df$TrAt  

 # 0) Set up indicator variables (iv) from the factor you choose.  
 op <- options(contrasts = c("contr.sum","contr.poly"))  
 dat<-subset(df,select=c("newcat"))  
 iv<-data.frame(model.matrix(~.-1,data=dat))  
 names(iv) <- levels(dat$newcat)  
 lbl<-levels(dat$newcat) # need this for plot functions below  

 # Select task variables with n > 1150 to be regressed (THIS CAN PROBABLY BE DONE MORE ELEGANTLY).  
 taskarr<-subset(df,   select=c("B20","B40","FW","Anim","TrAt","TrBt","TrBerr","TrDif","Snod15","tt","GEMS","Clock3","orient","Wlenc","wlfr","wlcr","wlrec","Snod15Rec","GEMSfr"))  

 ## 1) evaluate covariance matrix and extract sub-matrices  
 ## Caution: Covariance samples differ due to missing values.  
 sig <- cov(cbind(iv,taskarr),use="pairwise.complete.obs")  
Jon Erik Ween