I have a data frame with three columns: timestamp, key, event which is ordered by time.
ts,key,event
3,12,1
8,49,1
12,42,1
46,12,-1
100,49,1
From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:
ts,prob
3,1
8,1
12,1
46,2/3
100,2/3
My initial step is to calculate the cumsum grouped by key:
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))
In the second (and final) step i iterate over sumByKey
with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key)
. The prob is derived using the two vectors.
Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey
. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.
Is it possible to solve the second step in a better way, using e.g. ddply?