tags:

views:

99

answers:

3

I have two data frames that both have a column containing a factor like the following:

> head(test.data)
      var0     var1       date  store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
      var0             var1              date                store   
 Min.   : -9.72   Min.   : -2.297   Min.   :1990-03-30   Store1 : 8  
 1st Qu.: 68.32   1st Qu.: 71.305   1st Qu.:1990-09-07   Store2 : 8  
 Median :102.19   Median :101.192   Median :1991-02-13   Store3 : 8  
 Mean   :101.09   Mean   :103.042   Mean   :1991-02-13   Store4 : 8  
 3rd Qu.:151.22   3rd Qu.:151.940   3rd Qu.:1991-07-23   Store5 : 8  
 Max.   :196.55   Max.   :201.099   Max.   :1991-12-30   Store6 : 8  
                                                         (Other):48  
>
> head(test.clusters)
   store cluster
1 Store1       A
2 Store2       C
3 Store3       A
4 Store4       B
5 Store5       D
6 Store6       A
>
> summary(test.clusters)
     store   cluster
 Store1 :1   A:5    
 Store2 :1   B:4    
 Store3 :1   C:2    
 Store4 :1   D:1    
 Store5 :1          
 Store6 :1          
 (Other):6          
> 

I want to add a column to test.data containing each row's cluster, based on their shared store. I am currently doing this using a doubly nested loop:

new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
  for (j in seq(test.clusters$store)){
    if (test.data$store[i] == test.clusters$store[j]){
      new_col[i] <- test.clusters$cluster[j]
      break
    }       
  }
}
test.data$cluster <- new_col

This is very verbose, grossly inefficient and frankly ugly. What is the idiomatic method for doing this in R?

+6  A: 

Use merge function.

mbq
+1 See also http://stackoverflow.com/questions/3515611/merge-command-in-r and http://stackoverflow.com/questions/2232699/r-how-to-do-a-data-table-merge-operation for example, or do a search on "merge [r]" on the site.
Joris Meys
I'll look into it - didn't know what keyword to search for!
fmark
+2  A: 

I recommend match. It will be much faster then merge.

It should look like this:

test.data$cluster <- test.clusters$cluster[match(test.data$store, test.clusters$store)]
Marek
+5  A: 

Use the join function from plyr. It uses match internally, so should be fast (4-10x faster than merge).

hadley
Note that this only works with the current version of plyr. If you're stuck on an older version of R, for whatever reason, this won't be an option.
Harlan