tags:

views:

189

answers:

3

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R's cor() function only accepts numerical data (x must be numeric, says the error message), even if Spearman is used.

One brute approach is to delete the non-numerical columns from the dataframe. This is not as elegant, for speed I still don't want to calculate correlations between all columns.

I hope there is a way to simply say "calculate correlations for columns x, y, z". Column references could by number or by name. I suppose the flexible way to provide them would be through a vector.

Any suggestions are appreciated.

+8  A: 

if you have a dataframe where some columns are numeric and some are other (character or factor) and you only want to do the correlations for the numeric columns, you could do the following:

set.seed(10)

x = as.data.frame(matrix(rnorm(100), ncol = 10))
x$L1 = letters[1:10]
x$L2 = letters[11:20]

cor(x)

Error in cor(x) : 'x' must be numeric

but

cor(x[sapply(x, is.numeric)])

             V1         V2          V3          V4          V5          V6          V7
V1   1.00000000  0.3025766 -0.22473884 -0.72468776  0.18890578  0.14466161  0.05325308
V2   0.30257657  1.0000000 -0.27871430 -0.29075170  0.16095258  0.10538468 -0.15008158
V3  -0.22473884 -0.2787143  1.00000000 -0.22644156  0.07276013 -0.35725182 -0.05859479
V4  -0.72468776 -0.2907517 -0.22644156  1.00000000 -0.19305921  0.16948333 -0.01025698
V5   0.18890578  0.1609526  0.07276013 -0.19305921  1.00000000  0.07339531 -0.31837954
V6   0.14466161  0.1053847 -0.35725182  0.16948333  0.07339531  1.00000000  0.02514081
V7   0.05325308 -0.1500816 -0.05859479 -0.01025698 -0.31837954  0.02514081  1.00000000
V8   0.44705527  0.1698571  0.39970105 -0.42461411  0.63951574  0.23065830 -0.28967977
V9   0.21006372 -0.4418132 -0.18623823 -0.25272860  0.15921890  0.36182579 -0.18437981
V10  0.02326108  0.4618036 -0.25205899 -0.05117037  0.02408278  0.47630138 -0.38592733
              V8           V9         V10
V1   0.447055266  0.210063724  0.02326108
V2   0.169857120 -0.441813231  0.46180357
V3   0.399701054 -0.186238233 -0.25205899
V4  -0.424614107 -0.252728595 -0.05117037
V5   0.639515737  0.159218895  0.02408278
V6   0.230658298  0.361825786  0.47630138
V7  -0.289679766 -0.184379813 -0.38592733
V8   1.000000000  0.001023392  0.11436143
V9   0.001023392  1.000000000  0.15301699
V10  0.114361431  0.153016985  1.00000000
Greg
if you really only want to do the correlation on columns 1, 3, and 10, you could always do `cor(x[c(1, 3, 10)])`
Greg
Sorry, this is for numeric, not non-numeric data. I'll leave it just in case.
Greg
glad you left it, Greg. You already helped someone – it already helped me to se sapply in another creative way :)
ran2
+3  A: 

For numerical data you have the solution. But it is categorical data, you said. Then life gets a bit more complicated...

Well, first : The amount of association between two categorical variables is not measured with a Spearman rank correlation, but with a Chi-square test for example. Which is logic actually. Ranking means there is some order in your data. Now tell me which is larger, yellow or red? I know, sometimes R does perform a spearman rank correlation on categorical data. If I code yellow 1 and red 2, R would consider red larger than yellow.

So, forget about Spearman for categorical data. I'll demonstrate the chisq-test and how to choose columns using combn(). But you would benefit from a bit more time with Agresti's book : http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937

set.seed(1234)
X <- rep(c("A","B"),20)
Y <- sample(c("C","D"),40,replace=T)

table(X,Y)
chisq.test(table(X,Y),correct=F)
# I don't use Yates continuity correction

#Let's make a matrix with tons of columns

Data <- as.data.frame(
          matrix(
            sample(letters[1:3],2000,replace=T),
            ncol=25
          )
        )

# You want to select which columns to use
columns <- c(3,7,11,24)
vars <- names(Data)[columns]

# say you need to know which ones are associated with each other.
out <-  apply( combn(columns,2),2,function(x){
          chisq.test(table(Data[,x[1]],Data[,x[2]]),correct=F)$p.value
        })

out <- cbind(as.data.frame(t(combn(vars,2))),out)

Then you should get :

> out
   V1  V2       out
1  V3  V7 0.8116733
2  V3 V11 0.1096903
3  V3 V24 0.1653670
4  V7 V11 0.3629871
5  V7 V24 0.4947797
6 V11 V24 0.7259321

Where V1 and V2 indicate between which variables it goes, and "out" gives the p-value for association. Here all variables are independent. Which you would expect, as I created the data at random.

Joris Meys
Sorry, I have the tendency to nest functions quite often to avoid too many idle variables in my workspace. If you can't make sense of the code, just ask and I'll explain what it does.
Joris Meys
@Joris: thanks. I actually forgot to mention in the question that the data is categorical but ranked (the level of approval with something). You get a vote nonetheless for the code (from which I'll learn stuff anyway) and for the book reference.
wishihadabettername
ah, OK. That explains :-) Sorry for the lecture then, no harm meant. I can definitely recommend Agresti anyway. It's the standard when it comes down to categorical data analysis.
Joris Meys
A: 

I found an easier way by looking at the R script generated by Rattle. It looks like below:

correlations <- cor(mydata[,c(1,3,5:87,89:90,94:98)], use="pairwise", method="spearman")
wishihadabettername
This is almost exactly what [Greg wrote in a comment for his answer](http://stackoverflow.com/questions/3571909/r-how-to-calculate-correlation-cor-for-only-a-subset-of-columns/3572115#3572115).
Marek
Ah, OK, I got sidetracked by the use of sapply().
wishihadabettername