tags:

views:

294

answers:

4

I guess there will be a very simple answer to this. But here goes.

Data in long format. like this

d <- data.frame(cbind(numbers = rnorm(10), year = rep(c(2008, 2009), 5), name = c("john", "David", "Tom", "Kristin", "Lisa","Eve","David","Tom","Kristin","Lisa")))

How do I get a new dataframe only with rows for names that occur in both 2008 and 2009? (i.e. with only David, Kristin, Lisa and Tom).

Thanks in advance

+3  A: 

One approach is to use the reshape package to create a data.frame with years in columns and names in rows:

library(reshape)
cast(d, name ~ year, value = "numbers")

You could then use complete.cases to extract the rows of interest.

hadley
Thanks hadley! I was looking for a method that didn't involve casting and melting back and forth. I should have made this explicit. Thanks anyway!
Andreas
+2  A: 

If there is only one record per year, just count up the number of times each person appears in the dataset:

counts <- as.data.frame(table(name = d$name))

Then look for everyone who appeared twice:

subset(counts, Freq == 2)
hadley
That was actually the case. But I would still need to subset d with count$name-or something like that.
Andreas
Yeah, but I figured you could work that out yourself ;)
hadley
Yes - %in% is my new friend :-)
Andreas
+1  A: 

Here's another solution that uses just base R and doesn't make any assumptions about the number of records a person has per year:

d <- data.frame(cbind(numbers = rnorm(10), 
                      year = rep(c(2008, 2009), 5),
                      name = c("john", "David", "Tom", "Kristin",
                               "Lisa","Eve","David","Tom","Kristin",
                               "Lisa")))
# split data into 2 data.frames (1 for each year)
by.year <- split(d, d$year, drop=T)

# find the names that appear in both years
keep <- intersect(by.year[['2008']]$name, by.year[['2009']]$name)
# Or, if you had several years, use Reduce as a more general solution:
keep <- Reduce(intersect, lapply(by.year, '[[', 'name'))

# show the rows of the original dataset only if their $name field
# is in our 'keep' vector
d[d$name %in% keep,]
Steve Lianoglou
Thanks a lot steve. I suspect Reduce will be very usefull for me. Didn't know about it.
Andreas
+11  A: 

Simple way:

subset(
    d,
    name %in% intersect(name[year==2008], name[year==2009])
)
Marek
Brilliant - didn't know about intersect or %in%. Thanks so much!!!
Andreas