views:

2717

answers:

3

Given two data frames

df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))
df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))

> df1
  CustomerId Product
           1 Toaster
           2 Toaster
           3 Toaster
           4   Radio
           5   Radio
           6   Radio

> df2
  CustomerId   State
           2 Alabama
           4 Alabama
           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df1
  • An outer join of df1 and df2
  • A left outer join of df1 and df2
  • A right outer join of df1 and df2

P.S. IKT-JARQ (I Know This - Just Adding R Questions)

Extra credit:

How can I do a sql style select statement?

+3  A: 

There are some good examples of doing this over at the R Wiki. I'll steal a couple here:

Merge Method

Since your keys are named the same the short way to do an inner join is merge():

merge(df1,df2)

a full inner join (all records from both tables) can be created with the "all" keyword:

merge(df1,df2, all=TRUE)

a left outer join of df1 and df2:

merge(df1,df2, all.x=TRUE)

a right outer join of df1 and df2:

merge(df1,df2, all.y=TRUE)

you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)

Subscript Method

A left outer join with df1 on the left using a subscript method would be:

df1[,"State"]<-df2[df1[ ,"Product"], "State"]

The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")

JD Long
Huh, I didn't know that R would take "x" or "df1" for the "all.__" parameter.
Matt Parker
yeah, it's kinda nice when you read code done with the all.variable notation. It keeps you from having to think about which is x and which is y. Also easier for those uninitiated with merge() to figure out.
JD Long
It's `all.x` and `all.y`, not `all.df1` and `all.df2`.
hadley
I thought so, too, but I'm (usually) willing to give JD the benefit of the doubt. It actually works.
Matt Parker
I tested what Hadley is saying and (in my environment at least), all.df1 is not the same as all.x. Looks like all.df1 is ignored completely.
Dan Goldstein
Ah, I just tested all.y against all.df2, which are the same (even if R ignores the all.df2 parameter). Doesn't fly for df1 and x. That was lazy of me. I always forget that R just ignores parameters it doesn't understand - I sometimes wish it would throw a warning, at least.
Matt Parker
shit.. you guys are exactly right. I thought I tested it. I must have not had coffee yet. I will fix my answer!
JD Long
+10  A: 

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by="CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(df1, df2, all=TRUE)

Left outer: merge(df1, df2, all.x=TRUE)

Right outer: merge(df1, df2, all.y=TRUE)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.

Matt Parker
Matt, good answer! Do you mind fixing your missing parenthesis?
JD Long
Man, sloppy as hell today, apparently.
Matt Parker
Andhowaboutsomespacesinyourcode? ;)
hadley
I blame you for that, Hadley. I learned R at ISU, and this is how I turned out - clearly, you were not evangelizing sufficiently. :P
Matt Parker
That was before I become such an evangelist for spaces!
hadley
+6  A: 

Hi Dan - I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.

library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
       FROM df1 JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
       FROM df1 LEFT JOIN df2 USING(CustomerID)")

I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).

See Gabor's sqldf Google Code page for more information

dataspora