tags:

views:

552

answers:

2

I have a data frame where one particular column has a set of specific values (let's say, 1, 2, ..., 23). What I would like to do is to convert from this layout to the one, where the frame would have extra 23 (in this case) columns, each one representing one of the factor values. The data in these columns would be booleans indicating whether a particular row had a given factor value... To show a specific example:

Source frame:

ID       DATE         SECTOR
123      2008-01-01   1
456      2008-01-01   3
789      2008-01-02   5
... <more records with SECTOR values from 1 to 5>

Desired format:

ID       DATE         SECTOR.1   SECTOR.2   SECTOR.3   SECTOR.4   SECTOR.5
123      2008-01-01      T          F          F          F          F
456      2008-01-01      F          F          T          F          F
789      2008-01-02      F          F          F          F          T

I have no problem doing it in a loop but I hoped there would be a better way. So far reshape() didn't yield the desired result. Help would be much appreciated.

+3  A: 

I would try to bind another column called "value" and set value = TRUE.

df <- data.frame(cbind(1:10, 2:11, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
df <- data.frame(df, value=TRUE)

Then do a reshape:

reshape(df, idvar=c("ID","DATE"), timevar="SECTOR", direction="wide")

The problem with using the reshape function is that the default for missing values is NA (in which case you will have to iterate and replace them with FALSE).

Otherwise you can use cast out of the reshape package (see this question for an example), and set the default to FALSE.

df.wide <- cast(df, ID + DATE ~ SECTOR, fill=FALSE)
> df.wide 
   ID DATE     1     2     3
1   1    2  TRUE FALSE FALSE
2   2    3 FALSE  TRUE FALSE
3   3    4 FALSE FALSE  TRUE
4   4    5  TRUE FALSE FALSE
5   5    6 FALSE  TRUE FALSE
6   6    7 FALSE FALSE  TRUE
7   7    8  TRUE FALSE FALSE
8   8    9 FALSE  TRUE FALSE
9   9   10 FALSE FALSE  TRUE
10 10   11  TRUE FALSE FALSE
Shane
Thank you. I should've thought about creating a value column. Interestingly, the value column/reshape approach takes 1.4 seconds on 9,500 rows with 26 factor levels, whereas using iterative approach (over levels) takes only 0.6 seconds.
Alexander L. Belikoff
Don't be tricked by these functions: the `reshape` function itself does iteration if you look at it. But it does much else besides that, which will add to the overall time. Things like `reshape` are not intended to perform better; they are there just to make data manipulation easier.
Shane
+1  A: 

Here's another approach using xtabs which may or may not be faster (if someone would try and let me know):

df <- data.frame(cbind(1:12, 2:13, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
foo <- xtabs(~ paste(ID, DATE) + SECTOR, df)
cbind(t(matrix(as.numeric(unlist(strsplit(rownames(foo), " "))), nrow=2)), foo)
Jonathan Chang