views:

130

answers:

4

I will like to avoid a loop in the following code:

delta_S <- function(Ro, Rr, Ir, S1, S2, S3, S4, chromaty) { .... etc .....}

for (i in 1:nrow(Rrecon)) {
    gri[i, 6] <- delta_S(Ro=as.vector(Rrecon[i, ]), Rr=data_base$bck, Ir=data_base$Ir, S1=data_base$s1, S2=data_base$s2, S3=data_base$s3, S4=data_base$s4, chromaty="tetra")
}

My problem is that in my dataset, i varies from 1 to more than a million. I guess by using something like apply I could save time compared to the 19h currently required.

Thank you for any reply!

Update:

I have read some info about vectorization but it seems largely out of my skills (and I do not even speak about using C ou FORTRAN: I have learned R for three months only, to resolve this study). So by chance, if somebody has some time to look at in details my code and to propose a vectorization if it is possible, it would be hugely appreciated! Thank you Julien PS I copy the entire code because I do not know where it can be vectorize!

data<-read.table("databird.txt",sep="\t",dec=".",header=T,row.names=1)

#perform a PCA
pca<-dudi.pca(data, center = FALSE, scale = FALSE, scannf = FALSE, nf = ncol(data))

#definition of some functions
Q <- function(R,Ir,Si){
  temp <- R/100*Ir*Si
  return(sum((temp[2:length(temp)]+temp[1:(length(temp)-1)])/2))
  }

delta_f <- function(Ro,Rr,Ir,Si) {
  Qo <- Q(Ro,Ir,Si)
  Qr <- Q(Rr,Ir,Si)
  #if(Qo/Qr <0) {print("bug: Qo/Qr < 0, bug log neg"); return(NaN)}
  return(log(Qo/Qr))
  }

delta_S <- function(Ro,Rr,Ir,S1,S2,S3,S4,chromaty){
  if(chromaty=="tetra"){
    e1 <- 0.1
    e2 <- 0.07
    e3 <- 0.07
    e4 <- 0.05
    delta_f1 <- delta_f(Ro,Rr,Ir,S1)
    delta_f2 <- delta_f(Ro,Rr,Ir,S2)
    delta_f3 <- delta_f(Ro,Rr,Ir,S3)
    delta_f4 <- delta_f(Ro,Rr,Ir,S4)
    numerator <- (e1*e2)^2*(delta_f4-delta_f3)^2+(e1*e3)^2*(delta_f4-delta_f2)^2+(e1*e4)^2*(delta_f2-delta_f3)^2+(e2*e3)^2*(delta_f4-delta_f1)^2+(e2*e4)^2*(delta_f3-delta_f1)^2+(e3*e4)^2*(delta_f2-delta_f1)^2
    denominator <- (e1*e2*e3)^2+(e1*e2*e4)^2+(e1*e3*e4)^2+(e2*e3*e4)^2
    return(sqrt(numerator/denominator))
    }
  if(chromaty=="tri"){
    e1 <- 0.0425
    e2 <- 0.02
    e3 <- 0.02
    delta_f1 <- delta_f(Ro,Rr,Ir,S1)
    delta_f2 <- delta_f(Ro,Rr,Ir,S2)
    delta_f3 <- delta_f(Ro,Rr,Ir,S3)
    numerator <- (e1^2*(delta_f3-delta_f2)^2+e2^2*(delta_f3-delta_f1)^2+e3^2*(delta_f1-delta_f2)^2)
    denominator <- ((e1*e2)^2+(e1*e3)^2+(e2*e3)^2)
    return(sqrt(numerator/denominator))
    }
  if(chromaty=="di"){
    e1 <- 0.06
    e2 <- 0.02
    delta_f1 <- delta_f(Ro,Rr,Ir,S1)
    delta_f2 <- delta_f(Ro,Rr,Ir,S2)
    numerator <- (delta_f1-delta_f2)^2
    denominator <- (e1^2+e2^2)
    return(sqrt(numerator/denominator))
    }
  print("specification du niveau de chromaty incorrect")
  return(NaN)
  }

reconstBIS<-function (dudi, nf = 1,scoregrid, ...) 
    {
    if (!inherits(dudi, "dudi")) 
        stop("Object of class 'dudi' expected")
    if (nf > dudi$nf) 
        stop(paste(nf, "factors need >", dudi$nf, "factors available\n"))
    if (!inherits(dudi, "pca")) 
        stop("Object of class 'dudi' expected")
    cent <- dudi$cent
    norm <- dudi$norm
    n <- nrow(scoregrid)
    p <- ncol(dudi$tab)
    res <- matrix(0, n, p)
    for (i in 1:nf) {
        xli <- scoregrid[, i]
        yc1 <- dudi$c1[, i]
        res <- res + matrix(xli, n, 1) %*% matrix(yc1, 1, p)
    }
    res <- t(apply(res, 1, function(x) x * norm))
    res <- t(apply(res, 1, function(x) x + cent))
    res <- data.frame(res)
    names(res) <- names(dudi$tab)
    row.names(res) <- paste(scoregrid[,1],scoregrid[,2],scoregrid[,3],scoregrid[,4],scoregri    [,5],sep="/")
    return(res)
    }

#creation of a grid with values to enter in the function delta_S
ran1<-seq(-410,-400,2)
ran2<-seq(110,120,2)
ran3<-seq(10,20,2)
ran4<-seq(-40,-30,2)
ran5<-seq(10,20,2)
Length<-length(ran1)*length(ran2)*length(ran3)*length(ran4)*length(ran5)
print(Length)
gri <- matrix(NA,ncol=6,nrow=Length)
position <- 1
  for(i in 1:length(ran1)) {
     for(j in 1:length(ran2)){
        for(k in 1:length(ran3)){
          for(l in 1:length(ran4)){
            for(m in 1:length(ran5)){
                gri[position,1] <- ran1[i]
                gri[position,2] <- ran2[j]
                gri[position,3] <- ran3[k]
                gri[position,4] <- ran4[l]
                gri[position,5] <- ran5[m]
                position <- position + 1
                if(position %% 100 == 0) print(paste(round(100*position/Length,2),"%"))
              }
            }
          }
        }
      };gc()

Rrecon<-reconstBIS(pca,5,scoregrid=gri)
data_base <- read.table("data_base.txt",header=T,dec=".",sep="\t") 

And here's the final loop to avoid:

for (i in 1:nrow(Rrecon)) {
gri[i,6]<- delta_S(Ro=as.vector(Rrecon[i,]),Rr=data_base$bck,Ir=data_base$Ir,S1=data_base$s1,S2=data_base$s2,S3=data_base$s3,S4=data_base$s4,chromaty="tetra")
if(i %% 100 == 0) print(paste(round(100*i/nrow(Rrecon),2),"%"))
}

Thank you very much !

+4  A: 

Changing loop to apply don't speed up your code.
You should vectorize delta_S function or use parallel processing (if you have multi-core processor).

Marek
I agree, try to vectorize your function!eh, and try to indent your code by using four spaces / return.
ran2
+3  A: 

There are a few SO posts you might want to read on loops/vectorisation in R.

I agree with Marek - try vectorising your function. Another option is to rewrite the time-consuming parts in C or FORTRAN, then load them as shared objects.

nullglob
+2  A: 

You want more speed...

The suggestions to vectorize your function Delta_S are all well intentioned and would be great if it could be done. I'm not sure it can. Nevertheless, it's a little hard for me to see. It seems to me you need to combine columns of a data frame and rows of a matrix in your final outcome. This is going to be time consuming unless you can solve the row or column problems first. I'll get to that in a minute...

Your creation of your gri only requires you to enter (after the ran variables)

gri <- expand.grid(ran1,ran2,ran3,ran4,ran5)
gri[,6] <- NA
gri <- as.matrix(gri)

That's a lot of lines of code removed right there.

You have several vectors that are essentially constants in your code. They are taken from the database but used as vectors repeatedly (data_base$bck, data_base$Ir, data_base$s1, etc). Each needs to be solved once for that entire for loop. The Qr variable only needs to be solved once for all of Rrecon, not for each line. The denominator only needs to be solved once for all of Rrecon, not for each line... etc. Break the problem down doing all of that first. Then apply to your rows of Rrecon.

While it is sometimes true that apply doesn't save you much over a for loop there are various apply family commands that are very much faster than others. And they almost all, almost always, do save some time. Some of them save loads of time. You'll also be surprised to discover that applying small functions in a vector like syntax (thus implying many small for loops) is faster than applying a large function in a C like syntax.

Oh, and the short answer to getting rid of your final for loop (one of many that should be removed) is...

gri[,6]<- apply (Rrecon, 1, function(x){
    delta_S(Ro=as.vector(x)
    ,Rr=data_base$bck, Ir=data_base$Ir, S1=data_base$s1
    ,S2=data_base$s2, S3=data_base$s3, S4=data_base$s4
    ,chromaty="tetra")
    })

That may not get you a big speedup by itself. It would be much faster if you were just passing the numerator and denominator. But that would require separate apply family loops beforehand to solve each little sub calculations in your code (solve delta_f values, then solve numerator, etc).

You might want to also read the RInferno

John
A: 

You have a very simple calculation performed many, many times. I don't think this is vectorizable (although maybe if you posted the source formula, someone could do it in ten lines, but it's really hard to reverse engineer from your 100).

Two general suggestions, that I think are good habits:

(1) for every iteration of this loop, which occurs 10^6 times, you're storing a whole set of constants (the e_, ran_, Length, and gri). Try calculating those outside the loop.

(2) you also do a lot of a <- 2*b, c <- 3*d, e <- a/c calculations. Replace these three equations with one e <- 2*b / 3*d. You will be surprised how much time it saves when done 10^6.

There's a lot of good stuff in there that I'm sure helped prototyping and readability, but if you're doing 10^6 iterations, you really need to trim it down.

richardh