views:

329

answers:

2

Hi,

I'm working with R and I have a code like this:

for (i in 1:10)
   for (j in 1:100)
        if (data[i] == paths[j,1])
            cluster[i,4] <- paths[j,2]

where :

  • data is a vector with 100 rows and 1 column
  • paths is a matrix with 100 rows and 5 columns
  • cluster is a matrix with 100 rows and 5 columns

My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.

This is a problem when j=10000 for example, because execution time is very long.

Thank you

+1  A: 

Inner loop could be vectorized

cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]

but check Musa's comment. I think you indented something else.

Second (outer) loop could be vectorize either, by replicating vectors but

  1. if i is only 100 your speed-up don't be large
  2. it will need more RAM

[edit] As I understood your comment can you just use logical indexing?

indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]
Marek
+1  A: 

I think that both loops can be vectorized using the following:

cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]
gd047
I wonder how the performance of your vectorized solution compares to the looping alternative.
Guido
@Guido In this particular case it's hard to say cause results from original loop and gd047 solution differ, but in general difference between loop and vectorized code could be huge. Check my answer to http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r, where from hours you can go to less than second.
Marek
@Marek Using randomized test matrices I got equal cluster matrices using both methods. I checked the results using `all.equal(loop_sol,vect_sol)` Which are the the test matrices that you have used and gave you different results?
gd047
@gd047 Check this http://sites.google.com/site/fsh9rss8heh/ (too long for comment), I use R-2.10.1
Marek
@Marek Thanks. You are right. In my examples there were not more than one matches between data[i] and paths[j,1]. In the general case where there are more than one, the dominant is the one that is checked last. I am not sure which one dominates in the vectorized way. Do you have any idea?
gd047
@gd047 As states in `help("match")` return **positions of (first) matches**, so you could write you own version using rev `match_last <- function(x,y) length(y)-match(x,rev(y))+1`
Marek
@Marek Nice idea but there's still a difference.
gd047