tags:

views:

99

answers:

3

I need to take a data.frame in the format of:

  id1 id2 mean start end
1   A   D    4    12  15
2   B   E    5    14  15
3   C   F    6     8  10

and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:

   id1 id2 mean start end
1    A   D    4    12  13
2    A   D    4    13  14
3    A   D    4    14  15
21   B   E    5    14  15
31   C   F    6     8   9
32   C   F    6     9  10

I have written this function which works, but isn't written in very R'esque code:

dupData <- function(df){
    diff <- abs(df$start - df$end)
    ret <- {}

    #Expand our dataframe into the appropriate number of rows.
    for (i in 1:nrow(df)){
        for (j in 1:diff[i]){
            ret <- rbind(ret, df[i,])
        } 
    }

    #If matching ID1 and ID2, generate a sequential ordering of start & end dates
    for (k in 2:nrow(ret) - 1) {
        if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2]  ){ 
            ret[k, 5] <- ret[k, 4] + 1
            ret[k + 1, 4] <- ret[k, 5]  
        }
    }
    return(ret)
}

Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?

#sample daters
df <- data.frame(id1 = c("A", "B", "C")
        , id2 = c("D", "E", "F")
        , mean = c(4,5,6)  
        , start = c(12,14,8)
        , end = c(15, 15, 10)
)
A: 

The survSplit function of the survival package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.

Aniko
+2  A: 

There's probably a more general way to do this, but below uses rbind.fill.

cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
      rbind.fill(apply(df[,4:5], 1, function(x)
                       data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))


##     id1 id2 mean start end
## 1     A   D    4    12  13
## 1.1   A   D    4    13  14
## 1.2   A   D    4    14  15
## 2     B   E    5    14  15
## 3     C   F    6     8   9
## 3.1   C   F    6     9  10
apeescape
That is some pretty fancy work there, I appreciate it. It took ~1.5 minutes working with a 100k row data frame to output the data in the appropriate format. Thanks!
Chase