ansaurus

Question

Screening (multi)collinearity in a regression model

Answer 1

+15 A:

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity
> kappa(mm12)                            # a 'low' kappa is good
[1] 1.166029
> kappa(mm123)                           # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

This used approximations, see help(kappa) for details.

Dirk Eddelbuettel 2010-06-15 02:58:13

Sublime... thanks for this one!

aL3xa 2010-06-15 11:30:56

Answer 2

+4 A:

See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].

rcs 2010-06-15 07:50:52

Answer 3

+10 A:

Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)

# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184

> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.

gd047 2010-06-15 08:23:07

Thanks Γιώργος, +2 for this one! Great answer!

aL3xa 2010-06-15 16:50:13

Answer 4

+4 A:

You might like Vito Ricci's Reference Card "R Functions For Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions. In particular, it lists the vif function from the car package which can assess multicollinearity. http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

Jeromy Anglim 2010-06-15 09:08:29

Technically, and arithmetically, VIF = 1(1 - R^2), where R^2 refers to example I stated in my question. I forgot to mention VIF, so thanks for helping on this one! `relaimpo` is a great find!

aL3xa 2010-06-15 11:41:07

ansaurus

tags:

views:

answers:

Screening (multi)collinearity in a regression model

related questions