views:

125

answers:

2

Hi Everyone,

I am just starting to get beyond the basics in R and have come to a point where I need some help. I want to restructure some data. Here is what a sample dataframe may look like:

ID  Sex Res Contact
1   M   MA  ABR
1   M   MA  CON
1   M   MA  WWF
2   F   FL  WIT
2   F   FL  CON
3   X   GA  XYZ

I want the data to look like:

ID  SEX Res ABR CON WWF WIT XYZ
1   M   MA  1   1   1   0   0
2   F   FL  0   1   0   1   0
3   X   GA  0   0   0   0   1

What are my options? How would I do this in R? R is so powerful, I figure this is probably a breeze!

In short, I am looking to keep the values of the CONT column and use them as column names in the restructred data frame. I want to hold a variable set of columns constant (in th example above, I held ID, Sex, and Res constant). Also, is it possible to control the values in the restructured data? I may want to keep the data as binary, and sometimes, I may want to have the value be the count of times each contact value exists for each ID.

Any help will be very much appreciated,

Brock

+9  A: 

The reshape package is what you want. Documentation here: http://had.co.nz/reshape/. Not to toot my own horn, but I've also written up some notes on reshape's use here: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_reshape.html

For your purpose, this code should work

library(reshape)
data$value <- 1
cast(data, ID + Sex + Res ~ Contact, fun = "length")
JoFrhwld
I've been using R for a long time and I never knew that you could do data$value <- 1 instead of data$value <- rep(1, nrow(data)). Can't believe I never tried that -- did that always work?
Daniel Dickison
@Daniel You should also try `data$variable <- 1; data$variable[data$Group == "A"] <- 2`
JoFrhwld
I knew Hadley's reshape package was probably the answer, but I have had a hard time getting my head around it. In your code, what does the data$value with the assignment to 1 do?
Btibert3
For `reshape`, there must be some measurement value it refers to. The formula defines the "matrix", then it looks for all values which would fit in each cell. By default, it looks for values in a column called `value`, but you can override this. I had to add a `value` column, because after defining the matrix, there was no column corresponding to the measurement value, so I created a dummy one. If the number of values per cell <= 1, `cast` fills the cell with the value. Otherwise, it aggregates the values according to some function, `length` by default.
JoFrhwld
I defined an aggregation function because otherwise, cells with no observations would have been filled with `NA`. If you wanted `0` instead, so aggregated the `value` values with `length`.
JoFrhwld
Excellent thanks!
Btibert3
+1  A: 

model.matrix works great (this was asked recently, and gappy had this good answer):

> model.matrix(~ factor(d$Contact) -1)
  factor(d$Contact)ABR factor(d$Contact)CON factor(d$Contact)WIT factor(d$Contact)WWF factor(d$Contact)XYZ
1                    1                    0                    0                    0                    0
2                    0                    1                    0                    0                    0
3                    0                    0                    0                    1                    0
4                    0                    0                    1                    0                    0
5                    0                    1                    0                    0                    0
6                    0                    0                    0                    0                    1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(d$Contact)`
[1] "contr.treatment"
Vince
Eark! Misunderstood the question. You could use my answer and then use `tapply`, but JoFrhwld's answer is easier.
Vince