views:

55

answers:

3

Hi all,

I´d like to count the number of changes of binary factor variable. This variable can change from time to time back and forth multiple times for every user id. Now I´d like to count he number of changes per user id to this variable over a given timespan.

The data is sorted by id,year,month,myfactor. I tried this in MySQL but had no success so far. Is there an easy way to do it in R? I though about adding another column to my data.frame and adding up conditions step by step... Maybe some %in% stuff ?

Thx in advance for suggestions...

Hmm, of course... here´s some example – sorry for not providing it immediately, my head hurts ;):


   myf   Year    month userid   
  1 A    2005       1    260           
  2 B    2005       2    260           
  3 B    2005       4    260           
  4 A    2005       5    260           
  5 B    2005       6    260           
  6 B    2005       1    261 

if this is my dataset, I want to update the changes column, counting the number of changes of myf per user. Basically id like to end up with:

  user  changes
   260     3
   260     0

and so forth...

HTH

+2  A: 

Here's my guess.

set.seed(21)
Data <- data.frame(id=sample(letters[1:3],20,TRUE),
                   date=sample(1:3,20,TRUE),
                   myfactor=sample(0:1,20,TRUE))
Data <- Data[order(Data$id,Data$date),]

DataCh <- aggregate(Data[,"myfactor",FALSE],
            by=Data[,c("id","date")], function(x) sum(diff(x)!=0))
DataCh <- DataCh[order(DataCh$id,DataCh$date),]

EDIT: Here's an update with your example data.

lines <- "   myf   Year    month userid   
 1 A    2005       1    260           
 2 B    2005       2    260           
 3 B    2005       4    260           
 4 A    2005       5    260           
 5 B    2005       6    260           
 6 B    2005       1    261 "

Data <- read.table(con <- textConnection(lines)); close(con)

DataCh <- aggregate(Data[,"myf",FALSE],
            by=Data[,"userid",FALSE], function(x) sum(diff(unclass(x))!=0))

merge(Data,DataCh,by="userid",suffixes=c("",".change"))
#   userid myf Year month myf.change
# 1    260   A 2005     1          3
# 2    260   B 2005     2          3
# 3    260   B 2005     4          3
# 4    260   A 2005     5          3
# 5    260   B 2005     6          3
# 6    261   B 2005     1          0
Joshua Ulrich
+4  A: 
#Some data
dfr <- data.frame(
   binary_variable = runif(100) < .7,
   id = sample(7, 100, replace = TRUE)
)

#Split by id
split_by_id <- with(dfr, split(binary_variable, id))

#Number of changes
sapply(split_by_id, function(x) sum(diff(x) != 0))
Richie Cotton
This is pretty damn close to what I need. Right now I need to convert myf to some real TRUE / FALSE variable. With Richie Cotton data everything works fine. With my own I just get : Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator ... but probably I´ll figure that out..
ran2
To convert your 2-level `factor` to be a `logical` variable, try `myf == "A"`
Richie Cotton
@ran2 For factors use `as.numeric` or `unclass`. E.g. `split_by_id <- split(unclass(factor_variable), id))`.
Marek
Alternatively, `as.integer(myf)` works.
Richie Cotton
That worked perfectly. after I solved my troubles with logical values, converting it to a logical variable was the solution. Should I use merge to add the newly obtained information to my dataset ? Or can I simply cbind it, because the sorting is not affected?
ran2
@ran2: you need to merge by `userid` (see my answer)
Joshua Ulrich
If that is really all you need, this is easily done with a oneliner :Data$extra <- ave(as.integer(Data$myf),Data$id,FUN=function(x) sum(diff(x)!=0))
Joris Meys
Hell, with the help of Richie Cotton I solved it within 2 lines. Now, it´s even possible in one line. Thx a bunch, unfortunately I do not have enough accepts for y´all! a couple of hours ago I did not have any clue how to solve this and was desperately SQLing around.Yet I do not really get what´s being averaged (see help) with this ave function...
ran2
+5  A: 

Another edit :

Given your responses on the other solutions, you could get what you want in one line:

Data$extra <- ave(as.integer(Data$myf),Data$id,FUN=function(x) sum(diff(x)!=0))

No merge needed in this case.


"Over a given timespan" means that you could select a timespan and then apply the function. Joshuas answer is the fastest way around. There's a more general function that gives you more information on run lengths and values, rle. Be sure to check that one out.

Based on Joshuas answer, this example shows you how you can easily work with the dates to select a given timespan.

Edit: I updated the answer to show you how to easily convert your columns year and month into a date. You should also use as.numeric when applying the whole thing on a factor like yours.

#Testdata
set.seed(21)
Data <- data.frame(id=rep(letters[1:3],each=24),
                   year= rep(rep(c(2005,2006),each=12),6),
                   month=rep(1:12,6),
                   myf=sample(c("A","B"),24*3,TRUE))

#transformation
Data$dates <- as.Date(paste(Data$year,Data$month,"1",sep="-"))
#function

cond.count <- function(from,to,data){
    x <- data[data$dates>from & data$dates<to,]
    tapply(as.numeric(x$myf),x$id,function(y)sum(diff(y)!=0))
}

#example
from <- as.Date("2005-01-01")
to <- as.Date("2006-04-15")

cond.count(from,to,Data)
Joris Meys
The `yearmon` class in `zoo` is handy when you only need monthly granularity.
Joshua Ulrich
@Joshua: very true. I normally use the Date class as I only use other packages reluctantly, given the trouble they can give. But it would be a nicer solution in this case.
Joris Meys
How do I so consistently overlook `ave`? Awesome solution!
Joshua Ulrich
@Joshua: old monkeys and new tricks and all that? :-P
Joris Meys
+1 thanks Joris Meys. Another function I did not know: ave.
ran2