ansaurus

Question

What's the biggest R-gotcha you've run across?

Answer 1

+2 A:

accidentally listing source code of a function by forgetting to include empty parentheses: e.g. "ls" versus "ls()"
true & false don't cut it as pre-defined constants, like in Matlab, C++, Java, Python; must use TRUE & FALSE
invisible return values: e.g. ".packages()" returns nothing, while "(.packages())" returns a character vector of package base names

Stuart Andrews 2009-10-08 01:54:05

R does auto-complete though, so na.rm=T works as well as na.rm=TRUE. I always prefer the latter though for readability.

Vince 2009-10-08 02:48:12

Well, these aren't strictly equivalent. You can overwrite T, but TRUE is set. Try the following to confirm: { T <- FALSE; T }. Very dangerous! So, Stuart is right: you need be careful with your true/false values.

Shane 2009-10-08 03:11:03

Answer 2

+7 A:

Forgetting the drop=FALSE argument in subsetting matrices down to single dimension and thereby dropping the object class as well:

R> X <- matrix(1:4,2)
R> X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
     [,1]
[1,]    1
[2,]    2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>

Dirk Eddelbuettel 2009-10-08 02:02:03

Answer 3

+6 A:

Forgetting that strptime() and friends return POSIXt POSIXlt where length() is always nine -- converting to POSIXct helps:

R> length(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S"))
[1] 9
R> length(as.POSIXct(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S")))
[1] 1
R>

Dirk Eddelbuettel 2009-10-08 02:27:03

Answer 4

+6 A:

The automatic creation of factors when you load data. You unthinkingly treat a column in a data frame as characters, and this works well until you do something like trying to change a value to one that isn't a level. This will generate a warning but leave your data frame with NA's in it ...

When something goes unexpectedly wrong in your R script, check that factors aren't to blame.

edward 2009-10-08 02:54:13

This one had me confused once in my early R days.

Vince 2009-10-08 03:09:37

Right -- but you can use`options("stringsAsFactors"=FALSE)` in your startup file(s) to change this.

Dirk Eddelbuettel 2009-10-08 03:09:59

Answer 5

+5 A:

> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
  c.1..2..3..4. c.4..3..2..1.
1             1             4
2             2             3
4             4             1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4",  : 
  duplicate row.names: 4

Ian Fellows 2009-10-08 03:15:45

Interesting. I've been using R and its S predecessors since 1988 and I'd never seen that before!

Rob Hyndman 2009-10-08 09:24:31

Wow. That is very strange. Can you explain it?

Shane 2009-10-08 10:49:49

So what is going on here is:1. A four row data.frame is created, so the rownames are c(1,2,3,4)2. The third row is deleted, so the rownames are c(1,2,4)3. A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique.I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.

Ian Fellows 2009-10-08 15:13:38

Very interesting. Two thoughts: (1) it might be clearer in the long run to edit your answer and add your explanation there and (2) have you considered emailing this into the r-devel mail list?

Shane 2009-10-08 15:29:27

note that this is an error of print.data.frame. The code will run fine otherwise (with warnings.)

Eduardo Leoni 2009-10-08 16:09:16

I suppose that I could ask r-devel, but it might get shot down with prejudice by some of the stronger personalities there. From a performance perspective, checking for uniqueness is O(n), so that might be the reason. If someone else thinks that it should go to the developer list, I'll send it.

Ian Fellows 2009-10-08 16:13:38

O(n) worst case scenario. O(n) isn't that bad... I would send it to r-devel.

Vince 2009-10-12 22:49:34

Answer 6

+6 A:

Always test what happens when you have an NA!

One thing that I always need to pay careful attention to (after many painful experiences) is NA values. R functions are easy to use, but no manner of programming will overcome issues with your data.

For instance, any net vector operation with an NA is equal to NA. This is "surprising" on the face of it:

> x <- c(1,1,2,NA)
> 1 + NA
[1] NA
> sum(x)
[1] NA
> mean(x)
[1] NA

This gets extrapolated out into other higher-level functions.

In other words, missing values frequently have as much importance as measured values by default. Many functions have na.rm=TRUE/FALSE defaults; it's worth spending some time deciding how to interpret these default settings.

Edit 1: Marek makes a great point. NA values can also cause confusing behavior in indexes. For instance:

> TRUE && NA
[1] NA
> FALSE && NA
[1] FALSE
> TRUE || NA
[1] TRUE
> FALSE || NA
[1] NA

This is also true when you're trying to create a conditional expression (for an if statement):

> any(c(TRUE, NA))
[1] TRUE
> any(c(FALSE, NA))
[1] NA
> all(c(TRUE, NA))
[1] NA

When these NA values end up as your vector indexes, many unexpected things can follow. This is all good behavior for R, because it means that you have to be careful with missing values. But it can cause major headaches at the beginning.

Shane 2009-10-08 03:36:29

It pains in subscripting, eg. `(1:3)[c(TRUE,FALSE,NA)]` gives `1,NA`. It is easy to trap in this when you create logical vector on NA-contained vector `(1:3)[c(1,2,NA)<2]`.

Marek 2009-10-08 10:12:33

Answer 7

A:

Coming from compiled language and Matlab, I've gotten occasionally confused about a fundamental aspect of functions in functional languages: they have to be defined before they're used! It's not enough just for them to be parsed by the R interpreter. This mostly rears its head when you use nested functions.

In Matlab you can do:

function f1()
  v1 = 1;
  v2 = f2();
  fprintf('2 == %d\n', v2);

  function r1 = f2()
    r1 = v1 + 1 % nested function scope
  end
end

If you try to do the same thing in R, you have to put the nested function first, or you get an error! Just because you've defined the function, it's not in the namespace until it's assigned to a variable! On the other hand, the function can refer to a variable that has not been defined yet.

f1 <- function() {
  f2 <- function() {
    v1 + 1
  }

  v1 <- 1

  v2 = f2()

  print(sprintf("2 == %d", v2))
}

Harlan 2009-10-18 14:09:28

Answer 8

+1 A:

For instance, the number 3.14 is a numerical constant, but the expressions +3.14 and -3.14 are calls to the functions + and -:

> class(quote(3.14))
[1] "numeric"
> class(quote(+3.14))
[1] "call"
> class(quote(-3.14))
[1] "call"

See Section 13.2 in John Chambers book Software for Data Analysis - Programming with R

rcs 2009-11-06 12:04:18

Answer 9

+9 A:

[Hadley pointed this out in a comment.]

When using a sequence as an index for iteration, it's better to use the seq_along() function rather than something like 1:length(x).

Here I create a vector and both approaches return the same thing:

> x <- 1:10
> 1:length(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq_along(x)
 [1]  1  2  3  4  5  6  7  8  9 10

Now make the vector NULL:

> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0

This can cause some confusion in a loop:

> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>

Shane 2010-06-23 14:22:45

Answer 10

A:

Mine from today: qnorm() takes Probabilities and pnorm() takes Quantiles.

Adam SO 2010-08-03 15:50:56

because the names denote what they return, not what they take. Would you prefer sin() take ratios?

John 2010-08-03 16:57:11

Take it easy on me, it was my first submission :)

Adam SO 2010-08-04 16:28:55

Answer 11

+1 A:

First, let me say that I understand fundamental problems of representing numbers in a binary system. Nevertheless, one problem that I think could be easily improved is the representation of numbers when the decimal value is beyond R's typical scope of presentation.

x <- 10.2 * 100
x
1020
as.integer(x)
1019

I don't mind if the result is represented as an integer when it really can be represented as an integer. For example, if the value really was 1020 then printing that for x would be fine. But something as simple as 1020.0 in this case when printing x would have made it more obvious that the value was not an integer and not representable as one. R should default to some kind of indication when there is an extremely small decimal component that isn't presented.

John 2010-08-03 17:04:22

Answer 12

+1 A:

It can be annoying to have to allow for combinations of NA, NaN and Inf. They behave differently, and tests for one won't necessarily work for the others:

> x <- c(NA,NaN,Inf)
> is.na(x)
[1]  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE  TRUE FALSE
> is.infinite(x)
[1] FALSE FALSE  TRUE

However the safest way to test any of these trouble-makers is:

> is.finite(x)
[1] FALSE FALSE FALSE

nullglob 2010-08-14 13:36:14

ansaurus

tags:

views:

answers:

What's the biggest R-gotcha you've run across?

related questions