ansaurus

Question

R: How to calculate correlation - cor() - for only a subset of columns?

Answer 1

+8 A:

if you have a dataframe where some columns are numeric and some are other (character or factor) and you only want to do the correlations for the numeric columns, you could do the following:

set.seed(10)

x = as.data.frame(matrix(rnorm(100), ncol = 10))
x$L1 = letters[1:10]
x$L2 = letters[11:20]

cor(x)

Error in cor(x) : 'x' must be numeric

but

cor(x[sapply(x, is.numeric)])

             V1         V2          V3          V4          V5          V6          V7
V1   1.00000000  0.3025766 -0.22473884 -0.72468776  0.18890578  0.14466161  0.05325308
V2   0.30257657  1.0000000 -0.27871430 -0.29075170  0.16095258  0.10538468 -0.15008158
V3  -0.22473884 -0.2787143  1.00000000 -0.22644156  0.07276013 -0.35725182 -0.05859479
V4  -0.72468776 -0.2907517 -0.22644156  1.00000000 -0.19305921  0.16948333 -0.01025698
V5   0.18890578  0.1609526  0.07276013 -0.19305921  1.00000000  0.07339531 -0.31837954
V6   0.14466161  0.1053847 -0.35725182  0.16948333  0.07339531  1.00000000  0.02514081
V7   0.05325308 -0.1500816 -0.05859479 -0.01025698 -0.31837954  0.02514081  1.00000000
V8   0.44705527  0.1698571  0.39970105 -0.42461411  0.63951574  0.23065830 -0.28967977
V9   0.21006372 -0.4418132 -0.18623823 -0.25272860  0.15921890  0.36182579 -0.18437981
V10  0.02326108  0.4618036 -0.25205899 -0.05117037  0.02408278  0.47630138 -0.38592733
              V8           V9         V10
V1   0.447055266  0.210063724  0.02326108
V2   0.169857120 -0.441813231  0.46180357
V3   0.399701054 -0.186238233 -0.25205899
V4  -0.424614107 -0.252728595 -0.05117037
V5   0.639515737  0.159218895  0.02408278
V6   0.230658298  0.361825786  0.47630138
V7  -0.289679766 -0.184379813 -0.38592733
V8   1.000000000  0.001023392  0.11436143
V9   0.001023392  1.000000000  0.15301699
V10  0.114361431  0.153016985  1.00000000

Greg 2010-08-26 04:44:50

if you really only want to do the correlation on columns 1, 3, and 10, you could always do `cor(x[c(1, 3, 10)])`

Greg 2010-08-26 04:45:37

Sorry, this is for numeric, not non-numeric data. I'll leave it just in case.

Greg 2010-08-26 04:46:36

glad you left it, Greg. You already helped someone – it already helped me to se sapply in another creative way :)

ran2 2010-08-26 07:51:22

Answer 2

+3 A:

For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937

set.seed(1234)
X <- rep(c("A","B"),20)
Y <- sample(c("C","D"),40,replace=T)

table(X,Y)
chisq.test(table(X,Y),correct=F)
# I don't use Yates continuity correction

#Let's make a matrix with tons of columns

Data <- as.data.frame(
          matrix(
            sample(letters[1:3],2000,replace=T),
            ncol=25
          )
        )

# You want to select which columns to use
columns <- c(3,7,11,24)
vars <- names(Data)[columns]

# say you need to know which ones are associated with each other.
out <-  apply( combn(columns,2),2,function(x){
          chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
        })

out <- cbind(as.data.frame(t(combn(vars,2))),out)

Then you should get :

> out
   V1  V2       out
1  V3  V7 0.8116733
2  V3 V11 0.1096903
3  V3 V24 0.1653670
4  V7 V11 0.3629871
5  V7 V24 0.4947797
6 V11 V24 0.7259321

Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.

Joris Meys 2010-08-26 08:17:45

Sorry, I have the tendency to nest functions quite often to avoid too many idle variables in my workspace. If you can't make sense of the code, just ask and I'll explain what it does.

Joris Meys 2010-08-26 08:20:04

@Joris: thanks. I actually forgot to mention in the question that the data is categorical but ranked (the level of approval with something). You get a vote nonetheless for the code (from which I'll learn stuff anyway) and for the book reference.

wishihadabettername 2010-08-26 14:09:47

ah, OK. That explains :-) Sorry for the lecture then, no harm meant. I can definitely recommend Agresti anyway. It's the standard when it comes down to categorical data analysis.

Joris Meys 2010-08-26 14:12:47

Answer 3

A:

I found an easier way by looking at the R script generated by Rattle. It looks like below:

correlations <- cor(mydata[,c(1,3,5:87,89:90,94:98)], use="pairwise", method="spearman")

wishihadabettername 2010-08-27 15:40:27

This is almost exactly what [Greg wrote in a comment for his answer](http://stackoverflow.com/questions/3571909/r-how-to-calculate-correlation-cor-for-only-a-subset-of-columns/3572115#3572115).

Marek 2010-08-27 16:22:18

Ah, OK, I got sidetracked by the use of sapply().

wishihadabettername 2010-08-27 19:47:18

ansaurus

tags:

views:

answers:

R: How to calculate correlation - cor() - for only a subset of columns?

related questions