tags:

views:

103

answers:

2

Hello. I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.

It also has column names, which are strings. What I would like to do is the following:

  1. For every row in p I see a TRUE I would like to replace with the name of the corresponding column
  2. I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
  3. I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.

I can do step 1 using the following for loop:

for (i in seq(length(colnames(p)))) {
    p[p[,i]==TRUE,i]=colnames(p)[i]
}

but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.

I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!

+4  A: 

Here's some sample data:

df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))

You can use apply to do something like this:

names(df)[apply(df, 1, which)]

Or without apply by using which directly:

idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]
Shane
I'm getting old. You beat me by five minutes ;-)
Dirk Eddelbuettel
see comment under Dirk's solution! The second approach doesn't give the same response as the first..
Mike Dewar
I corrected that.
Shane
+3  A: 

Use apply to sweep your index through, and use that index to access the column names:

> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+                  c=c(FALSE,TRUE,FALSE))
> df
      a     b     c
1  TRUE FALSE FALSE
2 FALSE FALSE  TRUE
3 FALSE  TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
> 
Dirk Eddelbuettel
Wow. Yet again we came up with roughly the exact same solution at the same time independently. Even the data!
Shane
You win by five minutes, but I get a higher technical score for using TRUE/FALSE instead of the very naughty and discouraged T/F :)
Dirk Eddelbuettel
then who should get the green tick? (thanks both, btw)
Mike Dewar
Clearly, I should get the green tick since I gave *two* solutions. :)
Shane
hmm. I think there's something wrong with your second solution though! It doesn't deal with multiple TRUEs in one column, whereas the shared solution deals with this fine. Compare the outputs using `df <- data.frame(a=c(FALSE, TRUE, FALSE, TRUE), b=c(TRUE, FALSE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE, FALSE))` - which would you expect to be the appropriate behaviour?
Mike Dewar
Good catch. I just corrected that. Let us know which of those approaches *performs* better on your data set?
Shane
The second solution, where you form an index set first, takes less than half the time of the simpler apply. Don't know why, though! I'd have hoped that the simpler expression went faster! This decides the tick, though!
Mike Dewar
Great. `apply` is really nothing more an a loop (search stackoverflow for other discussions on this...); it could actually be slower than your `for` loop. You might also consider giving us each a vote to reward Dirk's diligence in using the full TRUE/FALSE names.
Shane
both get a vote! Thanks as always for the awesome help!
Mike Dewar