ansaurus

Question

Answer 1

+1 A:

Try this:

R> df <- data.frame(VAR = c(0,1,0,1,1))
R> df[ -which(df[,"VAR"]==1), , drop=FALSE]
  VAR
1   0
3   0
R>

We use which( booleanExpr ) to get the indices for which your condition holds, then use -1 on these to exclude them and lastly use a drop=FALSE to prevent our data.frame of one columns from collapsing into a vector.

Dirk Eddelbuettel 2010-08-31 22:39:50

Interesting but if I call `database$VAR` after this, I still am getting both 1's and 0's....

Brian 2010-08-31 22:46:30

You would have to assign the result back to `database` or assign it to a new variable.

Greg 2010-08-31 22:57:29

When I go:`data1base$NEW<- df`I get the error:Error in $<-.data.frame`(`*tmp*`, "NEW", value = list(VAR = c(0, 1, : replacement has 5 rows, data has 819

Brian 2010-08-31 23:20:30

you would have to use `database <- database[-which(database[,"VAR"]==1), , drop=FALSE]`

Greg 2010-08-31 23:26:33

Thank you! That works!

Brian 2010-08-31 23:28:57

@ Dirk Eddelbuettel - is there a reason why `subset` would not be appropriate in this instance? It seems like it would be a little easier on the eyes to the less trained R user: Using your sample data, this should work: `df.new <- subset(df, VAR == 0)`. Or am I missing something? Displaying the structure of the new object indicates `df.new` is a data.frame: `str(df.new)`

Chase 2010-09-01 00:40:28

Sure, that would also work. But boolean-based indexing is closer to the basic operators.

Dirk Eddelbuettel 2010-09-01 01:13:15

There is one difference between `subset` and indexing. `subset` removes `NA` values from index, when `[` return row of `NA`'s and `which` omit `NA` (so `-which` leave `NA` in `data.frame`).

Marek 2010-09-01 14:31:06

Answer 2

+3 A:

TMTOWTDI

Using subset:

df.new <- subset(df, VAR == 0)

EDIT:

David's solution seems to be the fastest on my machine. Subset seems to be the slowest. I won't even pretend to try and understand what's going on under that accounts for these differences:

> df <- data.frame(y=rep(c(1,0), times=1000000))
> 
> system.time(df[ -which(df[,"y"]==1), , drop=FALSE])
   user  system elapsed 
   0.16    0.05    0.23 
> system.time(df[which(df$y == 0), ])
   user  system elapsed 
   0.03    0.01    0.06 
> system.time(subset(df, y == 0))
   user  system elapsed 
   0.14    0.09    0.27

Chase 2010-09-01 04:04:15

Include `drop=FALSE` in second timing. It will slow down this method.

Marek 2010-09-01 14:32:16

Answer 3

+2 A:

I'd upvote the answer using "subset" if I had the reputation for it :-) . You can also use a logical vector directly for subsetting -- no need for "which":

d <- data.frame(VAR = c(0,1,0,1,1))
d[d$VAR == 0, , drop=FALSE]

I'm surprised to find the logical version a little faster in at least one case. (I expected the "which" version might win due to R possibly preallocating the proper amount of storage for the result.)

> d <- data.frame(y=rep(c(1,0), times=1000000))
> system.time(d[which(d$y == 0), ])
   user  system elapsed 
  0.119   0.067   0.188 
> system.time(d[d$y == 0, ])
   user  system elapsed 
  0.049   0.024   0.074

David F 2010-09-01 06:40:47

+1 for timing the code

midtiby 2010-09-01 06:46:12

You should include `drop=FALSE` in timing. And for me `which` is faster (with TRUE or FALSE).

Marek 2010-09-01 14:14:05

ansaurus

tags:

views:

answers:

Redefine Data Frame in R

related questions