tags:

views:

298

answers:

1

I do find R to be a very frustrating language. It seems to excel on data-dependent bugs. Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.

>   # A toy example
>   a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
>   a
  col1 col2 col3 col4
1    1    2    1    4
2    2   NA    2    3
3    3    2    3    2
4    4    3    4    1
>   
>   # Bummer, there's an NA in my condition
>   a$col2==2
[1]  TRUE    NA  TRUE FALSE
> 
>   # Why is this a good thing to do?
>   # It NA'd the whole row, and kept it
>   a[a$col2==2,]
   col1 col2 col3 col4
1     1    2    1    4
NA   NA   NA   NA   NA
3     3    2    3    2
>   
>   # Yes, this is the right way to do it
>   a[!is.na(a$col2) & a$col2==2,]
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2
>     
>   # Subset seems designed to avoid this problem
>   subset(a, col2 == 2)
  col1 col2 col3 col4
1    1    2    1    4
3    3    2    3    2

Can someone explain why the behavior you get without the is.na check would ever by good or useful?

+6  A: 

I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:

Missing values ('NA') and 'NaN' values are regarded as non-comparable even to themselves, so comparisons involving them will always result in 'NA'.

In other words, a missing value isn't comparable using a binary operator (because it's unknown).

Beyond is.na(), you could also do:

which(a$col2==2) # tests explicitly for TRUE

Or

a$col2 %in% 2 # only checks for 2

%in% is defined as using the match() function:

'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'

This is also covered in "The R Inferno".

Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".

Shane

related questions