views:

72

answers:

2

I am sure this is a very basic question:

In R I have 600,000 categorical variables - each of which is classified as "0", "1", or "2"

What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in the end I only want "0" and "1" as categories for each of the variables.

Also, if possible I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!

What would be the best way to do this?

Thank you!

+3  A: 

There is a function recode in package car (Companion to Applied Regression):

require("car")    
recode(x, "c('1','2')='1'; else='0'")

or for your case in plain R:

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
 [1] 1 1 1 0 1 0 2 0 1 0
Levels: 0 1 2
> factor(pmin(as.numeric(x), 2), labels=c("0","1"))
 [1] 1 1 1 0 1 0 1 0 1 0
Levels: 0 1

Update: To recode all categorical columns of a data frame tmp you can use the following

recode_fun <- function(x) factor(pmin(as.numeric(x), 2), labels=c("0","1"))
require("plyr")
catcolwise(recode_fun)(tmp)
rcs
Thank you for the response! This is how I am applying it to my data specifically. My data is in the form of a data.frame, which I would like to maintain: data <- read.table("k.csv", header=TRUE, sep = ",") dta<- data[,1:30] col = dim(dta)[2] for (y in 1:col) { py<- factor(pmin(as.data.frame(dta[,y]), 2), labels=c("0","1")) py }Of course that results in an error - I am sure I am not applying it properly
CCA
+3  A: 

recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.

If it's numeric

x <- ifelse(x>1, 1, x)

if it's character

x <- ifelse(x=='2', '1', x)

if it's factor with levels 0,1,2

levels(x) <- c(0,1,1)

Any of those can be applied across a data frame dta to the variable x in place. For example...

 dta$x <- ifelse(dta$x > 1, 1, dta$x)

Or, multiple columns of a frame

 df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))
John