tags:

views:

428

answers:

1

Note: I changed the example from when I first posted. My first example was too simplified to capture the real problem.

I have two data frames which are sorted differently in one column. I want to match one column and then merge in the value from the second column. The second column needs to stay in the same order.

So I have this:

state<-c("IA","IA","IA","IL","IL","IL")
value1<-c(1,2,3,4,5,6)
s1<-data.frame(state,value1)
state<-c("IL","IL","IL","IA","IA","IA")
value2<-c(3,4,5,6,7,8)
s2<-data.frame(state,value2)

s1
s2

which returns this:

> s1
  state value1
1    IA      1
2    IA      2
3    IA      3
4    IL      4
5    IL      5
6    IL      6
> s2
  state value2
1    IL      3
2    IL      4
3    IL      5
4    IA      6
5    IA      7
6    IA      8

and I want this:

  state value1 value2
1    IA      1      6
2    IA      2      7
3    IA      3      8
4    IL      4      3
5    IL      5      4
6    IL      6      5

I'm about to drive myself silly trying to solve this. Seems like it should be a simple subscript problem.

+3  A: 

There are several ways to do this (it is R, after all) but I think the most clear is creating an index. We need a function that creates a sequential index (starting at one and ending with the number of observations).

seq_len(3) 
> [1] 1 2 3

But we need to calculate this index within each grouping variable (state). For this we can use R's ave function. It takes a numeric as the first argument, then the grouping factors, and finally the function to be applied in each group.

s1$index <- with(s1,ave(value1,state,FUN=seq_len))
s2$index <- with(s2,ave(value2,state,FUN=seq_len))

(Note the use of with, which tells R to search for the variables within the environment/dataframe. This is better practice than using s1$value1, s2$value2, etc.)

Now we can simply merge (join) the two data frames (by the variables present in the both data frames: state and index).

merge(s1,s2)

which gives

   state index value1 value2
1    IA     1      1      6
2    IA     2      2      7
3    IA     3      3      8
4    IL     1      4      3
5    IL     2      5      4
6    IL     3      6      5

For this to work, there should be the same number of observations by state in each of the data frames.

[Edit: commented the code for clarity.] [Edit: Used seq_len instead of creating a new function as suggested by hadley.]

Eduardo Leoni
given an infinite number of monkeys and an infinite number of computers I still could not have come up with that. Thanks you mucho!
JD Long
What is going on in those 3 lines? Can you elaborate a little bit?
Peter Mortensen
You can replace `FUN=function(x) 1:length(x))` with `seq_len`
hadley
Yes, `seq_len` is exactly your getIndex function.
hadley