Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.
Here's mine: in list indexing, my.list[[1]] is not my.list[1]. Learned this in the early days of R.
Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.
Here's mine: in list indexing, my.list[[1]] is not my.list[1]. Learned this in the early days of R.
accidentally listing source code of a function by forgetting to include empty parentheses: e.g. "ls" versus "ls()"
true & false don't cut it as pre-defined constants, like in Matlab, C++, Java, Python; must use TRUE & FALSE
invisible return values: e.g. ".packages()" returns nothing, while "(.packages())" returns a character vector of package base names
Forgetting the drop=FALSE argument in subsetting matrices down to single dimension and thereby dropping the object class as well:
R> X <- matrix(1:4,2)
R> X
[,1] [,2]
[1,] 1 3
[2,] 2 4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
[,1]
[1,] 1
[2,] 2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>
Forgetting that strptime()
and friends return POSIXt POSIXlt
where length()
is always nine -- converting to POSIXct
helps:
R> length(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S"))
[1] 9
R> length(as.POSIXct(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S")))
[1] 1
R>
The automatic creation of factors when you load data. You unthinkingly treat a column in a data frame as characters, and this works well until you do something like trying to change a value to one that isn't a level. This will generate a warning but leave your data frame with NA's in it ...
When something goes unexpectedly wrong in your R script, check that factors aren't to blame.
> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
c.1..2..3..4. c.4..3..2..1.
1 1 4
2 2 3
4 4 1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4", :
duplicate row.names: 4
Always test what happens when you have an NA!
One thing that I always need to pay careful attention to (after many painful experiences) is NA values. R functions are easy to use, but no manner of programming will overcome issues with your data.
For instance, any net vector operation with an NA is equal to NA. This is "surprising" on the face of it:
> x <- c(1,1,2,NA)
> 1 + NA
[1] NA
> sum(x)
[1] NA
> mean(x)
[1] NA
This gets extrapolated out into other higher-level functions.
In other words, missing values frequently have as much importance as measured values by default. Many functions have na.rm=TRUE/FALSE defaults; it's worth spending some time deciding how to interpret these default settings.
Edit 1: Marek makes a great point. NA values can also cause confusing behavior in indexes. For instance:
> TRUE && NA
[1] NA
> FALSE && NA
[1] FALSE
> TRUE || NA
[1] TRUE
> FALSE || NA
[1] NA
This is also true when you're trying to create a conditional expression (for an if statement):
> any(c(TRUE, NA))
[1] TRUE
> any(c(FALSE, NA))
[1] NA
> all(c(TRUE, NA))
[1] NA
When these NA values end up as your vector indexes, many unexpected things can follow. This is all good behavior for R, because it means that you have to be careful with missing values. But it can cause major headaches at the beginning.
Coming from compiled language and Matlab, I've gotten occasionally confused about a fundamental aspect of functions in functional languages: they have to be defined before they're used! It's not enough just for them to be parsed by the R interpreter. This mostly rears its head when you use nested functions.
In Matlab you can do:
function f1()
v1 = 1;
v2 = f2();
fprintf('2 == %d\n', v2);
function r1 = f2()
r1 = v1 + 1 % nested function scope
end
end
If you try to do the same thing in R, you have to put the nested function first, or you get an error! Just because you've defined the function, it's not in the namespace until it's assigned to a variable! On the other hand, the function can refer to a variable that has not been defined yet.
f1 <- function() {
f2 <- function() {
v1 + 1
}
v1 <- 1
v2 = f2()
print(sprintf("2 == %d", v2))
}
For instance, the number 3.14 is a numerical constant, but the expressions +3.14 and -3.14 are calls to the functions +
and -
:
> class(quote(3.14))
[1] "numeric"
> class(quote(+3.14))
[1] "call"
> class(quote(-3.14))
[1] "call"
See Section 13.2 in John Chambers book Software for Data Analysis - Programming with R
[Hadley pointed this out in a comment.]
When using a sequence as an index for iteration, it's better to use the seq_along()
function rather than something like 1:length(x)
.
Here I create a vector and both approaches return the same thing:
> x <- 1:10
> 1:length(x)
[1] 1 2 3 4 5 6 7 8 9 10
> seq_along(x)
[1] 1 2 3 4 5 6 7 8 9 10
Now make the vector NULL
:
> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0
This can cause some confusion in a loop:
> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>
Mine from today: qnorm() takes Probabilities and pnorm() takes Quantiles.
First, let me say that I understand fundamental problems of representing numbers in a binary system. Nevertheless, one problem that I think could be easily improved is the representation of numbers when the decimal value is beyond R's typical scope of presentation.
x <- 10.2 * 100
x
1020
as.integer(x)
1019
I don't mind if the result is represented as an integer when it really can be represented as an integer. For example, if the value really was 1020 then printing that for x would be fine. But something as simple as 1020.0 in this case when printing x would have made it more obvious that the value was not an integer and not representable as one. R should default to some kind of indication when there is an extremely small decimal component that isn't presented.
It can be annoying to have to allow for combinations of NA
, NaN
and Inf
. They behave differently, and tests for one won't necessarily work for the others:
> x <- c(NA,NaN,Inf)
> is.na(x)
[1] TRUE TRUE FALSE
> is.nan(x)
[1] FALSE TRUE FALSE
> is.infinite(x)
[1] FALSE FALSE TRUE
However the safest way to test any of these trouble-makers is:
> is.finite(x)
[1] FALSE FALSE FALSE