views:

90

answers:

3

Hi,

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.

Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.

Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))

The too slow loop looks like this:

a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
   a[i] <- data[i,data[i,31]]
}

I would think this would work:

a <- data[,data[,31]]

... but it results in "Error: cannot allocate vector of size 2.8 Mb".

I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.

The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.

Your support is highly appreciated!

Chris

+2  A: 
t(data[,1:30])[30*(0:399999)+data[,31]]

This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.

James
A: 

Singe-index notation for the matrix may use less memory. This would involve doing something like:

i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]

Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.

## create a random (10 x 5) matrix                                                                                                                           
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5                                                                                             
## columns                                                                                                                                                   
MM <- cbind(M,apply(M,1,which.max))
##             column ID          row ID                                                                                                                     
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))

Using an index matrix is an alternative that will probably use more memory but is slightly clearer:

ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
nullglob
A: 

Try to change the code to work a column at a time:

M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
    a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}

This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

Henrico