views:

509

answers:

3

Hi, i have a big performance problem in R. I wrote a function that iterates over an data.frame object. It simply adds a new col to a data.frame and accumulate sth. (simple operation). The data.frame has round about 850.000 rows. My PC is still working about 10h now and i have no idea about the runtime.

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

Any ideas how to speed up this operation ?

+4  A: 

This could be made much faster by skipping the loops by using indexes or nested ifelse() statements.

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."
Shane
I just discovered ifelse last night, amazed at how much it has sped up some of my code.
Stedy
Thanks for the answer. I try to understand your statements. The line 4: "temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10]" caused an error because the length of longer object is not a multiple of the length of the shorter object. "temp[idx1,9] = num [1:11496]" and"temp[which(idx1)-1,10] = int [1:11494]"so 2 rows are missing.
Kay
If you provide a data sample (use dput() with a few rows) then I'll fix it for you. Because of the which()-1 bit, the indexes are unequal. But you should see how it works from here: there's no need for any looping or applying; just use vectorized functions.
Shane
Wow! I've just changed an nested if..else function block and mapply, to a nested ifelse function and got a 200x speedup!
James
@Shane Your general advice is aright, but in code you missed fact, that `i`-th value depends on `i-1`-th so they can't be set in way you do it (using `which()-1`).
Marek
A: 

In R, you can often speed-up loop processing by using the apply family functions (in your case, it would probably be replicate). Have a look at the plyr package that provides progress bars.

Another option is to avoid loops altogether and replace them with vectorized arithmetics. I'm not sure exactly what you are doing, but you can probably apply your function to all rows at once:

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

This will be much much faster, and then you can filter the rows with your condition:

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

Vectorized arithmetics requires more time and thinking about the problem, but then you can sometimes save several orders of magnitude in execution time.

Calimo
you're spot on that vector functions will be faster than loops or apply() but it's not true that apply() is faster than loops. In many cases apply() is simply abstracting the loop away from the user but still looping. See this previous question: http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar
JD Long
+4  A: 

Biggest problem and root of ineffectiveness is indexing data.frame, I mean all this lines where you use temp[,].
Try to avoid this as much as possible. I took your function, change indexing and here version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

As you can see I create vector res which gather results. At the end I add it to data.frame and I don't need to mess with names. So how better is it?

I run each function for data.frame with nrow from 1,000 to 10,000 by 1,000 and measure time with system.time

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

Result is

performance

You can see that your version depends exponentially from nrow(X). Modified version has linear relation, and simple lm model predict that for 850,000 rows computation takes 6 minutes and 10 seconds.

Power of vectorization

As Shane and Calimo states in theirs answers vectorization is a key to better performance. From your code you could move outside of loop:

  • conditioning
  • initialization of the results (which are temp[i,9])

This leads to this code

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

Compare result for this functions, this time for nrow from 10,000 to 100,000 by 10,000.

performance

Tuning the tuned

Another tweak is to changing in a loop indexing temp[i,9] to res[i] (which are exact the same in i-th loop iteration). It's again difference between indexing a vector and indexing a data.frame.
Second thing: when you look on the loop you can see that there is no need to loop over all i, but only for the ones that fit condition.
So here we go

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

Performance which you gain highly depends on a data structure. Precisely - on percent of TRUE values in the condition. For my simulated data it takes computation time for 850,000 rows below the one second.

performance

I you want you can go further, I see at least two things which can be done:

  • write a C code to do conditional cumsum
  • if you know that in your data max sequence isn't large then you can change loop to vectorized while, something like

    while (any(cond)) {
        indx <- c(FALSE, cond[-1] & !cond[-n])
        res[indx] <- res[indx] + res[which(indx)-1]
        cond[indx] <- FALSE
    }
    
Marek