tags:

views:

160

answers:

5

Is there a more "R-minded" way to dichotomise efficiently? Thanks.

y<-c(0,3,2,1,0,0,2,5,0,1,0,0);b<-vector()

for (k in 1:length(y)) {
    if (y[k] == 0) b[k] = 0
    else
        b[k] = 1
}
y;b
A: 

You have something that works. Are you worried about speed for some reason? Here's an alternative:

y<-c(0,3,2,1,0,0,2,5,0,1,0,0)

decider = function( x ) {
   if ( x == 0 ) {
      return(0)
   }

   return(1)
}

b = sapply( y, decider )
James Thompson
Out of curiosity: is that any faster than the original version?
Shane
+5  A: 

Try this:

b <- rep(0, length(y))
b[y != 0] <- 1

This is efficient because y and b are the same size and rep() is very fast/vectorized.

Edit:Here's another approach:

b <- ifelse(y == 0, 0, 1)

The ifelse() function is also vectorized.

Shane
Using ifelse is much less efficient than your first suggestion. ifelse creates some vectors along the way which slows things down when y is very large.
Rob Hyndman
Thanks Rob. Good to know! Just wanted to show other approaches so that people can add them to their toolkit and stop all this unecessary iteration. Your approach is very efficient.
Shane
"to show other approaches" - that's exactly why I liked Shane's answer to my previous question - if a person really wants to learn, he would normally be interested in various ways of doing one and the same thing - for the sake of learning and out of curiosity. It seem I can only accept one answer, though.
knot
I always time various approaches using Systime() (see my answer below). I can't even tell you how surprised I've been by results over the years.
Vince
+4  A: 
b <- as.numeric(y!=0)
Rob Hyndman
This is about the same speed as Shane's first suggestion, but somewhat neater. Both are much faster than any of the other suggestions given.
Rob Hyndman
You could also drop the `as.numeric`.
hadley
+2  A: 

Use ifelse(). This is vectorized and (edit: somewhat) fast.

> y <- c(0,3,2,1,0,0,2,5,0,1,0,0)
> b <- ifelse(y == 0, 0, 1)
 [1] 0 1 1 1 0 0 1 1 0 1 0 0

Edit 2: This approach is less fast than the as.numeric(y!=0) approach.

> t <- Sys.time(); b <- as.numeric(y!=0); Sys.time() - t # Rob's approach
Time difference of 0.0002379417 secs
> t <- Sys.time(); b <- ifelse(y==0, 0, 1); Sys.time() - t # Shane's 2nd and my approach
Time difference of 0.000428915 secs
> t <- Sys.time(); b = sapply( y, decider ); Sys.time() - t # James's approach
Time difference of 0.0004429817 sec

But to some, ifelse may be trivially more readable than the as.numeric approach.

Note the OP's version took 0.0004558563 to run.

Vince
You need to time such things on much longer vectors to get good estimates. Enlarge the vector until the slowest methods takes about 10 secs.
Thierry
@Thierry: I agree completely, I just am lazy :-) I actually repeated these in terminal multiple times to ensure they were somewhat consistent.
Vince
For this case, it's a bit of a waste of time - you probably spend a million times more time on the timings than actually running the code. There's no need to profile until you discover that performance is actually a problem.
hadley
How else do we see which is the most efficient solution (the original question) without timing?
Vince
+1  A: 

b<-(y!=0)+0

b [1] 0 1 1 1 0 0 1 1 0 1 0 0

DWin