tags:

views:

81

answers:

2

I have a data frame with three columns: timestamp, key, event which is ordered by time.

ts,key,event
 3,12,1
 8,49,1
 12,42,1
 46,12,-1
 100,49,1

From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:

ts,prob
3,1
8,1
12,1
46,2/3
100,2/3

My initial step is to calculate the cumsum grouped by key:

items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))

In the second (and final) step i iterate over sumByKey with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key). The prob is derived using the two vectors.

Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.

Is it possible to solve the second step in a better way, using e.g. ddply?

A: 

If your problem is only computational time, I bet the better idea will be to implement your algorithm as a C chunk; you may first use R to convert keys to a coherent interval of integers (as.numeric(factor(...))) and then use boolean array in C to obtain unique key number easily and very fast. Remember that neither plyr nor standard R *pplys are significantly faster than loops (providing both are used without embarrassing errors, of course).

mbq
I think it is what I've written, or I just don't understand your comment.
mbq
+2  A: 

If my interpretation is right, then this should do it :

items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))

# numbers of keys that sum to zero, no ddply necessary
nzero <- cumsum(ave(items$event,items$key,FUN=cumsum)==0)

# number of unique keys at a given timepoint
nunique <- rep(F,length(items$key))
nunique[match(unique(items$key),items$key)] <- T
nunique <- cumsum(nunique)

# makes :
items$p <- (nunique-nzero)/nunique

items
   ts key event         p
1   3  12     1 1.0000000
2   8  49     1 1.0000000
3  12  42     1 1.0000000
4  46  12    -1 0.6666667
5 100  49     1 0.6666667
Joris Meys
I like this solution, very efficient and elegant, thanks!
mkhq