views:

328

answers:

3

Dear overflowers,

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.

Regards,

//Mi

A: 

Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.

> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
  ID GRP ID.FREQ GRP.FREQ
1  1   a       4        2
2  1   a       4        2
3  1   b       4        2
4  1   b       4        2
5  2   a       4        2
6  2   a       4        2
7  2   b       4        2
8  2   b       4        2
wkmor1
Not exactly sure what was meant by 'return these as variables in the dataset', but I interpreted it as best I could.
wkmor1
... yeah, I didn't have a clue about that until I saw yours.
Matt Parker
+3  A: 

The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.

Using wkmor1's df:

library(plyr)

x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)

which returns:

  ID GRP V1
1  1   a  2
2  1   b  2
3  2   a  2
4  2   b  2

And then merge that back on to the original data:

merge(x, df, by = c("ID", "GRP"))
Matt Parker
... if you're going to downvote, could you at least mention why?
Matt Parker
+2  A: 

OK, given the interpretations I see, then the fastest and easiest solution is...

df$IDCount <- ave(df$ID, df$group, FUN = length)
John