ansaurus

Question

dropping factor levels in a subsetted data frame in R

Answer 1

+9 A:

It is a known issue, and one possible remedy is provided by drop.levels() in the gdata package where your example becomes

> drop.levels(subdf)
  letters numbers
1       a       1
2       b       2
3       c       3
> levels(drop.levels(subdf)$letters)
[1] "a" "b" "c"

There is also the dropUnusedLevels function in the Hmisc package. However, it only works by altering the subset operator [ and is not applicable here.

As a corollary, a direct approach on a per-column basis is a simple as.factor(as.character(data)):

> levels(subdf$letters)
[1] "a" "b" "c" "d" "e"
> subdf$letters <- as.factor(as.character(subdf$letters))
> levels(subdf$letters)
[1] "a" "b" "c"

Dirk Eddelbuettel 2009-07-28 18:37:13

Answer 2

+2 A:

This is obnoxious. This is how I usually do it, to avoid loading other packages:

levels(subdf$letters)<-c("a","b","c",NA,NA)

which gets you:

> subdf$letters
[1] a b c
Levels: a b c

Note that the new levels will replace whatever occupies their index in the old levels(subdf$letters), so something like:

levels(subdf$letters)<-c(NA,"a","c",NA,"b")

won't work.

This is obviously not ideal when you have lots of levels, but for a few, it's quick and easy.

Matt Parker 2009-07-28 18:44:32

Answer 3

+9 A:

All you should have to do is to apply factor() to your variable again after subsetting:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c

From the ?factor page example:

factor(ff)      # drops the levels that do not occur

Stephen 2009-07-28 22:41:31

That's fine for a one-off, but in a data.frame with a large number of columns, you get to do that on every column that is a factor ... leading to the need for a function such as drop.levels() from gdata.

Dirk Eddelbuettel 2009-07-29 14:16:12

I see... but from a user-perspective it's quick to write something like subdf[] <- lapply(subdf,function(x) if(is.factor(x)) factor(x) else x) ...Is drop.levels() much more efficient computationally or better with large data sets? (One would have to rewrite the line above in a for-loop for a huge data frame, I suppose.)

Stephen 2009-07-29 17:09:15

dataspora 2009-07-30 04:18:31

Answer 4

+5 A:

If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with read.table or read.csv:

options(stringsAsFactors = FALSE)

The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)

hadley 2009-07-28 23:53:43

You can also do read.csv(file='foo.csv', as.is=T).

andrewj 2009-07-29 01:37:27

Answer 5

+5 A:

Here's another way, which I believe is equivalent to the factor(..) approach:

> df <- data.frame(let=letters[1:5], num=1:5)
> subdf <- df[df$num <= 3, ]

> subdf$let <- subdf$let[ , drop=TRUE]

> levels(subdf$let)
[1] "a" "b" "c"

ars 2009-07-29 03:40:37

Answer 6

+1 A:

I wrote utility functions to do this. Now that I know about gdata's drop.levels, it looks pretty similar. Here they are (from here):

present_levels <- function(x) intersect(levels(x), x)

trim_levels <- function(...) UseMethod("trim_levels")

trim_levels.factor <- function(x)  factor(x, levels=present_levels(x))

trim_levels.data.frame <- function(x) {
  for (n in names(x))
    if (is.factor(x[,n]))
      x[,n] = trim_levels(x[,n])
  x
}

Brendan OConnor 2009-09-01 20:37:36

ansaurus

tags:

views:

answers:

dropping factor levels in a subsetted data frame in R

related questions