tags:

views:

82

answers:

2

Hello, I am trying to loop over the a matrix and do the correlation coefficiency of each two-row and print out the correlation matrix.

ID A B C D E F G H I
Row01 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
(30000 rows!)

I want the Pearson correlation output as:

 Row01
Row01 1.000
Row02 0.012
Row03 0.023
Row04 0.820
Row05 0.165
Row06 0.230
Row07 0.376
Row08 0.870

output as Row01.txt

Row02
Row01 0.012
Row02 1.000
Row03 0.023
Row04 0.820
Row05 0.165
Row06 0.230
Row07 0.376
Row08 0.870

output as Row02.txt. . . . .

output files will be 30000!

I am aware of this algorithm looks stupid, the matrix<-cor(T(data)) will do the whole thing, and half of the corr matrix is enough as the corr result is symmetric along the diagnol.

But my problems are

  1. my data is too big for R to handle 30000x30000.
  2. It is hard to retrieve the specific correlations of a specific row with the rest.
  3. Using my "stupid algorithm" I can easily get the corr of my interest from the folder.

Thanks in advance!

Ivan

A: 

Not tested, but something like this should work I guess

EDIT: corrected code to avoid huge matrix

correl <- NULL
for (i in 1:nrow(datamatrix))
    {
    correl <- apply(datamatrix, 1, function(x){cor(datamatrix[,i], x)})
    write.table(correl, paste("col", i, ".txt", sep="")
    }
nico
Hm I fear that doesn't fly. Original Poster claimed `datamatrix` was too big for memory.
Dirk Eddelbuettel
@Dirk Eddelbuettel: hmmm that's true, I assumed he was talking about the output matrix, but the input matrix is huge too... didn't think about that. wasn't there a package to handle huge matrices in memory or am I wrong?
nico
Thanks! I had problem with my SUSE where I want to use. I will try the code and get back soon.
Ivan
A: 

Thanks Nico! Almost got there after I corrected small bugs. Here I attach my script:

datamatrix=read.table("ref.txt",sep="\t",header=T,row.names=1) correl <- NULL for (i in 1:nrow(datamatrix)) { correl <- apply(datamatrix, 1, function(x){cor(t(datamatrix[,i]))}) write.table(correl, paste(row.names(datamatrix)[i], ".txt", sep="")) }

But I am afraid the function(x) part is of problem, that seems to be t(datamatrix[i,j]), which will calculate corr of any two rows.

Actually I need to iterate through the matrix. first cor(row01, row02) get one correlation between rwo01 and row02; then cor(row01, row03) to get the correlation of row01 and rwo03, ....and till correlation between row01 row30000.Now I got the first column for row01 Row01 1.000 Row02 0.012 Row03 0.023 Row04 0.820 Row05 0.165 Row06 0.230 Row07 0.376 Row08 0.870 and save it to file row01.txt;

Similarly get Row02 Row01 0.012 Row02 1.000 Row03 0.023 Row04 0.820 Row05 0.165 Row06 0.230 Row07 0.376 Row08 0.870 and save it to file row02.txt.

Totally I will get 30000 files. It is stupid, but this can skip the memory limit and can be easily handled for the correlation of a specific row.

Ivan