tags:

views:

77

answers:

3

Hello. I have a data frame

database$VAR

which has values of 0's and 1's.

How can I redefine the data frame so that the 1's are removed?

Thanks!

+1  A: 

Try this:

R> df <- data.frame(VAR = c(0,1,0,1,1))
R> df[ -which(df[,"VAR"]==1), , drop=FALSE]
  VAR
1   0
3   0
R> 

We use which( booleanExpr ) to get the indices for which your condition holds, then use -1 on these to exclude them and lastly use a drop=FALSE to prevent our data.frame of one columns from collapsing into a vector.

Dirk Eddelbuettel
Interesting but if I call `database$VAR` after this, I still am getting both 1's and 0's....
Brian
You would have to assign the result back to `database` or assign it to a new variable.
Greg
When I go:`data1base$NEW<- df`I get the error:Error in $<-.data.frame`(`*tmp*`, "NEW", value = list(VAR = c(0, 1, : replacement has 5 rows, data has 819
Brian
you would have to use `database <- database[-which(database[,"VAR"]==1), , drop=FALSE]`
Greg
Thank you! That works!
Brian
@ Dirk Eddelbuettel - is there a reason why `subset` would not be appropriate in this instance? It seems like it would be a little easier on the eyes to the less trained R user: Using your sample data, this should work: `df.new <- subset(df, VAR == 0)`. Or am I missing something? Displaying the structure of the new object indicates `df.new` is a data.frame: `str(df.new)`
Chase
Sure, that would also work. But boolean-based indexing is closer to the basic operators.
Dirk Eddelbuettel
There is one difference between `subset` and indexing. `subset` removes `NA` values from index, when `[` return row of `NA`'s and `which` omit `NA` (so `-which` leave `NA` in `data.frame`).
Marek
+3  A: 

TMTOWTDI

Using subset:

df.new <- subset(df, VAR == 0)

EDIT:

David's solution seems to be the fastest on my machine. Subset seems to be the slowest. I won't even pretend to try and understand what's going on under that accounts for these differences:

> df <- data.frame(y=rep(c(1,0), times=1000000))
> 
> system.time(df[ -which(df[,"y"]==1), , drop=FALSE])
   user  system elapsed 
   0.16    0.05    0.23 
> system.time(df[which(df$y == 0), ])
   user  system elapsed 
   0.03    0.01    0.06 
> system.time(subset(df, y == 0))
   user  system elapsed 
   0.14    0.09    0.27 
Chase
Include `drop=FALSE` in second timing. It will slow down this method.
Marek
+2  A: 

I'd upvote the answer using "subset" if I had the reputation for it :-) . You can also use a logical vector directly for subsetting -- no need for "which":

d <- data.frame(VAR = c(0,1,0,1,1))
d[d$VAR == 0, , drop=FALSE]

I'm surprised to find the logical version a little faster in at least one case. (I expected the "which" version might win due to R possibly preallocating the proper amount of storage for the result.)

> d <- data.frame(y=rep(c(1,0), times=1000000))
> system.time(d[which(d$y == 0), ])
   user  system elapsed 
  0.119   0.067   0.188 
> system.time(d[d$y == 0, ])
   user  system elapsed 
  0.049   0.024   0.074 
David F
+1 for timing the code
midtiby
You should include `drop=FALSE` in timing. And for me `which` is faster (with TRUE or FALSE).
Marek