tags:

views:

493

answers:

4

This question came today in the manipulatr mailing list.

http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f

I am rephrasing.

Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.

Code:

library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)

The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.

Any ideas?

+2  A: 

My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru

int Trag_noeq(int row, int col, int N)
{
   //assert(row != col);    //You can add this in if you like
   if (row<col)
      return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
   else if (col<row)
      return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
   else
      return -1;
}

After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got.
EDIT: Using the vectorized version contributed by rcs

noeq.1 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    ix <- ifelse(i < j,
                 i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
                 j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
    ix
}

## To get the indexes of the row, the following one liner works:

getrow <- function(z, N) noeq.1(z, 1:N, N)

## to get the row sums

getsum <- function(d, f=sum) {
    N <- attr(d, "Size")
    sapply(1:N, function(i) {
        if (i%%100==0) print(i)
        f(d[getrow(i,N)])
    })
}

So, with the example:

sumd2 <- getsum(d)

This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!

Eduardo Leoni
+2  A: 

This is a vectorized version of the function noeq (either argument i or j):

noeq.1 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    ifelse(i < j,
           i*(N-1) - ((i-1)*i)/2 + j - i,
           j*(N-1) - ((j-1)*j)/2 + i - j) * ifelse(i == j, 0, 1)
}   

> N <- 4
> sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N)))
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    1    0    4    5
[3,]    2    4    0    6
[4,]    3    5    6    0
> sapply(1:N, function(i) noeq.1(i, 1:N, N))
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    1    0    4    5
[3,]    2    4    0    6
[4,]    3    5    6    0

Timings are done on a 2.4 GHz Intel Core 2 Duo (Mac OS 10.6.1):

> N <- 1000
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
   user  system elapsed 
  0.676   0.061   0.738 
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
   user  system elapsed 
 14.359   0.032  14.410
rcs
Good example of how R can be fast: 20x improvement!
Eduardo Leoni
+3  A: 

First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.

Here's another version without using ifelse (that's a relatively slow function):

noeq.2 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
    x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
    idx <- i < j
    x[!idx] <- x2[!idx]
    x[i==j] <- 0
    x
}

And timings on my laptop:

> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
   user  system elapsed 
  51.31    0.10   52.06 
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
   user  system elapsed 
   2.47    0.02    2.67 
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
   user  system elapsed 
   0.88    0.01    1.12

And lapply is faster than sapply:

> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
   user  system elapsed 
   0.67    0.00    0.67
Shane
Hi Shane, only to mention that your link is broken, here is the correct one - http://rwiki.sciviews.org/doku.php?id=tips:programming:code_optim2
Tal Galili
A: 

Oh! It's very usefull and fantastic .

Thank you alot .

Khaled
This is a comment, not an answer.
Shane