views:

413

answers:

1

Combining 2 columns into 1 column many times in a very large dataset in R

The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).

Here is a toy example of the data structure:

pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2),  rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))

Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.

1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):

11
31
11
31
31
11
11
11
31
11

2) I need to identify the least frequent SNP value (in the above example it is 31).

3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.

+1  A: 

Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then

R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""), 
+                                rs157=paste(pop[,7],pop[,8],sep=""), 
+                                rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
   status sex age disType rs123 rs157 rs132
1       0   0  42       0    11    24    44
2       1   1  37       0    31    44    44
3       1   0  38       0    11    24    44
4       0   1  45       0    31    22    44
5       1   1  25       0    31    24    44
6       0   1  31       0    11    44    44
7       1   0  43       0    11    44    44
8       0   0  41       0    11    44    44
9       1   1  57       0    31    22    24
10      1   1  40       0    11    22    24

and now you can do counts and whatnot on pop2:

R> sapply(pop2[,5:7], table)
$rs123

11 31 
 6  4 

$rs157

22 24 44 
 3  3  4 

$rs132

24 44 
 2  8 

R> 
Dirk Eddelbuettel
paste! Of course! I did mean concatenate. Thanks so much for the help. Now I'm working on making it work across 45,000 columns. Thanks again!
S.R.
You can work with `grep()` and `match()` to get you column indices. Also, feel free to upvote and/or accept this answer if it strikes you as the right one :-)
Dirk Eddelbuettel
accepted! :) I don't have enough reputation points apparently to upvote yet...!
S.R.