tags:

views:

1357

answers:

5

In R, what is the most efficient/idiomatic way to count the number of TRUE values in a logical vector? I can think of two ways:

> z<-sample(c(TRUE,FALSE),1000,rep=TRUE)
> sum(z)
[1] 498
> table(z)["TRUE"]
TRUE 
 498 

Which do you prefer? Is there anything even better

+1  A: 

Another way is

> length(z[z==TRUE])
[1] 498

While sum(z) is nice and short, for me length(z[z==TRUE]) is more self explaining. Though, I think with a simple task like this it does not really make a difference...

If it is a large vector, you probably should go with the fastest solution, which is sum(z). length(z[z==TRUE]) is about 10x slower and table(z)[TRUE] is about 200x slower than sum(z).

Summing up, sum(z) is the fastest to type and to execute.

f3lix
+6  A: 

There are some problems when logical vector contains NA values.
See for example:

z <- c(TRUE, FALSE, NA)
sum(z) # gives you NA
table(z)["TRUE"] # gives you 1
length(z[z==TRUE]) # f3lix answer, gives you 2 (because NA indexing returns values)

So I think safe is sum(z, na.rm=TRUE) (which gives 1). I think that table solution is less efficient (look at the code of table function).

Marek
+7  A: 

Another option which hasn't been mentioned is to use which:

length(which(z))

Just to actually provide some context on the "which is faster question", it's always easiest just to test yourself. I made the vector much larger for comparison:

z <- sample(c(TRUE,FALSE),1000000,rep=TRUE)
system.time(sum(z))
   user  system elapsed 
   0.03    0.00    0.03
system.time(length(z[z==TRUE]))
   user  system elapsed 
   0.75    0.07    0.83 
system.time(length(which(z)))
   user  system elapsed 
   1.34    0.28    1.64 
system.time(table(z)["TRUE"])
   user  system elapsed 
  10.62    0.52   11.19 

So clearly using sum is the best approach in this case. You may also want to check for NA values as Marek suggested.

Just to add a note regarding NA values and the which function:

> which(c(T, F, NA, NULL, T, F))
[1] 1 4
> which(!c(T, F, NA, NULL, T, F))
[1] 2 5

Note that which only checks for logical TRUE, so it essentially ignores non-logical values.

Shane
BTW, there was a nice trick with timing in Dirk answer: http://stackoverflow.com/questions/1748590/revolution-for-r/1748932#1748932
Marek
+1  A: 

which is good alternative, especially when you operate on matrices (check ?which and notice the arr.ind argument). But I suggest that you stick with sum, because of na.rm argument that can handle NA's in logical vector. For instance:

# create dummy variable
set.seed(100)
x <- round(runif(100, 0, 1))
x <- x == 1
# create NA's
x[seq(1, length(x), 7)] <- NA

If you type in sum(x) you'll get NA as a result, but if you pass na.rm = TRUE in sum function, you'll get the result that you want.

> sum(x)
[1] NA
> sum(x, na.rm=TRUE)
[1] 43

Is your question strictly theoretical, or you have some practical problem concerning logical vectors?

aL3xa
I was trying to grade a quiz. Doing something like sum(youranswer==rightanswer) within an apply.
Jyotirmoy Bhattacharya
My reply is just too long, so I posted a new answer, since it differs from previous one.
aL3xa
A: 

I've been doing something similar a few weeks ago. Here's a possible solution, it's written from scratch, so it's kind of beta-release or something like that. I'll try to improve it by removing loops from code...

The main idea is to write a function that will take 2 (or 3) arguments. First one is a data.frame which holds the data gathered from questionnaire, and the second one is a numeric vector with correct answers (this is only applicable for single choice questionnaire). Alternatively, you can add third argument that will return numeric vector with final score, or data.frame with embedded score.

fscore <- function(x, sol, output = 'numeric') {
    if (ncol(x) != length(sol)) {
        stop('Number of items differs from length of correct answers!')
    } else {
        inc <- matrix(ncol=ncol(x), nrow=nrow(x))
        for (i in 1:ncol(x)) {
            inc[,i] <- x[,i] == sol[i]
        }
        if (output == 'numeric') {
            res <- rowSums(inc)
        } else if (output == 'data.frame') {
            res <- data.frame(x, result = rowSums(inc))
        } else {
            stop('Type not supproted!')
        }
    }
    return(res)
}

I'll try to do this in a more elegant manner with some *ply function. Notice that I didn't put na.rm argument... Will do that

# create dummy data frame - values from 1 to 5
set.seed(100)
d <- as.data.frame(matrix(round(runif(200,1,5)), 10))
# create solution vector
sol <- round(runif(20, 1, 5))

Now apply a function:

> fscore(d, sol)
 [1] 6 4 2 4 4 3 3 6 2 6

If you pass data.frame argument, it will return modified data.frame. I'll try to fix this one... Hope it helps!

aL3xa
One-liner: `rowSums(t(t(d)==sol), na.rm=TRUE)`. R recycle vector for comparison. If your `d` were matrix with cases in columns then its simplifies to `rowSums(d==sol, na.rm=TRUE)`.
Marek