tags:

views:

290

answers:

4

Suppose I have a data.frame with N rows. The id column has 10 unique values; all those values are integers greater than 1e7. I would like to rename them to be numbered 1 through 10 and save these new IDs as a column in my data.frame.

Additionally, I would like to easily determine 1) id given id.new and 2) id.new given id.

For example:

> set.seed(123)
> ids <- sample(1:1e7,10)
> A <- data.frame(id=sample(ids,100,replace=TRUE),
                  x=rnorm(100))
> head(A)
       id          x
1 4566144  1.5164706
2 9404670 -1.5487528
3 5281052  0.5846137
4  455565  0.1238542
5 7883051  0.2159416
6 5514346  0.3796395
+1  A: 

Using factors:

> A$id <- as.factor(A$id)
> A$id.new <- as.numeric(A$id)
> head(A)
       id          x id.new
1 4566144  1.5164706      4
2 9404670 -1.5487528     10
3 5281052  0.5846137      5
4  455565  0.1238542      1
5 7883051  0.2159416      7
6 5514346  0.3796395      6

Suppose x is the old ID and you want the new one.

> x <- 7883051
> as.numeric(which(levels(A$id)==x))
[1] 7

Suppose y is the new ID and you want the old one.

> as.numeric(as.character(A$id[which(as.integer(A$id)==y)[1]]))
[1] 5281052

(The above finds the first value of id at which the internal code for the factor is 5. Are there better ways?)

Christopher DuBois
Old to new doesn't need the `as.numeric`.New to old is just `levels(A$id)[new]`
hadley
A: 

One option is to use the hash package:

> library(hash)
> sn <- sort(unique(A$id))
> g <- hash(1:length(sn),sn)
> h <- hash(sn,1:length(sn))
> A$id.new <- .get(h,A$id)
> head(A)
       id          x id.new
1 4566144  1.5164706      4
2 9404670 -1.5487528     10
3 5281052  0.5846137      5
4  455565  0.1238542      1
5 7883051  0.2159416      7
6 5514346  0.3796395      6

Suppose x is the old ID and you want the new one.

> x <- 7883051
> .get(h,as.character(x))
7883051 
      7

Suppose y is the new ID and you want the old one.

> y <- 5
> .get(g,as.character(y))
      5 
5281052

(This can sometimes be more convenient/transparent than using factors.)

Christopher DuBois
+1  A: 

Try this:

A$id.new <- match(A$id,unique(A$id))

Additional comment: To get the table of values:

rbind(unique(A$id.new),unique(A$id))
Rob Hyndman
ooooh. Hadn't thought of that. That's pretty slick. Is there any way to easily recover the mapping?
Christopher DuBois
Just save `unique(A$id)` - it's equivalent to `levels(factor(A$id))`
hadley
+1  A: 

You can use factor() / ordered() here:

R> set.seed(123)
R> ids <- sample(1:1e7,10)
R> A <- data.frame(id=sample(ids,100,replace=TRUE), x=rnorm(100))
R> A$id.new <- as.ordered(as.character(A$id))
R> table(A$id.new)

2875776 4089769  455565 4566144 5281052 5514346 7883051 8830172 8924185 9404670 
      6      10       6       8      12      10      13      10      10      15

And you can then use as.numeric() to map to 1 to 10:

R> A$id.new <- as.numeric(A$id.new)
R> summary(A)
       id                x               id.new     
 Min.   : 455565   Min.   :-2.3092   Min.   : 1.00  
 1st Qu.:4566144   1st Qu.:-0.6933   1st Qu.: 4.00  
 Median :5514346   Median :-0.0634   Median : 6.00  
 Mean   :6370243   Mean   :-0.0594   Mean   : 6.07  
 3rd Qu.:8853675   3rd Qu.: 0.5575   3rd Qu.: 8.25  
 Max.   :9404670   Max.   : 2.1873   Max.   :10.00  
R>
Dirk Eddelbuettel