tags:

views:

321

answers:

6

One of the basic data types in R is factors. In my experience factors are basically a pain in the ass and I never use them. I always convert to characters. I feel oddly like I'm missing something. Are there a lot of functions that use factors as grouping variables? When should I be using factors? Do you use them?

+1  A: 

What a snarky title!

I believe many estimation functions allow you to use factors to easily define dummy variables... but I don't use them for that.

I use them when I have very large character vectors with few unique observations. This can cut down on memory consumption, especially if the strings in the character vector are longer-ish.

PS - I'm joking about the title. I saw your tweet. ;-)

Joshua Ulrich
So you really just use them to conserve storage space. That makes sense.
JD Long
Well at least it used to ;-). But a few R version ago character storage was rewritten to be internally hashed so part of this historic argument is now void. Still factors are *very* useful for grouping and modeling.
Dirk Eddelbuettel
According to `?factor` it was R-2.6.0 and it says, "Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes." Would you save space converting to factor if the character string needed 8 bytes?
Joshua Ulrich
N <- 1000;a <- sample(c("a","b", "c"), N, replace=TRUE); print(object.size(a), units="Kb"); print(object.size(factor(a)), units="Kb"); 8 Kb4.5 Kb so it still seems to save some space.
Eduardo Leoni
@Eduardo I got 4Kb vs 4.2Kb. For `N=100000` I got 391.5 Kb vs 391.8 Kb. So factor takes little more memory.
Marek
+8  A: 

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, stringsAsFactors() is true (and most users miss this subtly). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

> lm(Petal.Length ~ -1 + Species, data=iris)

Call:
lm(formula = Petal.Length ~ -1 + Species, data = iris)

Coefficients:
    Speciessetosa  Speciesversicolor   Speciesvirginica  
            1.462              4.260              5.552  

> iris.alt <- iris
> iris.alt$Species <- as.character(iris.alt$Species)
> lm(Petal.Length ~ -1 + Species, data=iris.alt)

Call:
lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

Coefficients:
    Speciessetosa  Speciesversicolor   Speciesvirginica  
            1.462              4.260              5.552  

Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'Species' converted to a factor
> 

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

> s <- iris$Species
> s[s == 'setosa', drop=TRUE]
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa
> s[s == 'setosa', drop=FALSE]
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
> 

However, with dataframes, the behavior of [.data.frame() is different: see this email or ?[.data.frame (in backticks, which StackOverflow won't let me escape). Using drop=TRUE on dataframes does not work as you'd imagine:

> x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
> x$Species
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
[41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
> 

Luckily you can drop factors easily with:

> x <- subset(iris, Species == 'setosa', drop=TRUE)
> levels(x$Species)
[1] "setosa"     "versicolor" "virginica" 
> x$Species <- factor(x$Species)
> levels(x$Species)
[1] "setosa"

or:

> x$Species <- x$Species[drop=TRUE]
> levels(x$Species)
[1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

Vince
+3  A: 

Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain. More recently, as in the past few years a lot of the functions have improved to handle the factors better. For instance, rbind plays nicely with them. I still find it a total nuisance to have left over empty levels after a subset function.

#drop a whole bunch of unused levels from a whole bunch of columns that are factors using gdata
require(gdata)
drop.levels(dataframe)

I know that it is straightforward to recode levels of a factor and to rejig the labels and there are also wonderful ways to reorder the levels. My brain just cannot remember them and I have to relearn it every time I use it. Recoding should just be a lot easier than it is.

R's string functions are quite easy and logical to use. So when manipulating I generally prefer characters over factors.

Farrel
Do you have examples of stats analysis that use factors?
JD Long
+1  A: 
doug
This isn't a good example, because all those examples would work with strings too.
hadley
+3  A: 

ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:

d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]
mdsumner
that's a neat application. Never thought of that.
JD Long
+2  A: 

Here's a graphic I made for class to explain the differences:

Character, factor & ordered factor

But 99% of the time, character vectors will serve you just as well as factors.

hadley
I'm getting the feel that the 99% rule is right. Possibly more given my work flow and analysis.
JD Long
@hadley I think the deep linking of the image is not working properly.
JD Long
Maybe this will work better.
hadley