views:

686

answers:

6

I am trying to use GLSMultipleLinearRegression (from apache commons-math package) for multiple linear regression. It is expecting a covariance matrix as input -- I am not sure how to compute them. I have one array of dependent variables and 3 arrays of independent variables.
Any idea how to compute the covariance matrix?

Note: I have 200 items for each of the 3 independent variables

Thanks
Bharani

+1  A: 

Have you tried creating a Covariance matrix directly from your data?

new Covariance().computeCovarianceMatrix(data)

Using the information in the comment, we know that there are 3 independent, 1 dependent variables and 200 samples. That implies that you will have a data array with 4 columns and 200 rows. The end result will look something like this (typing everything out explicitly in order to try to explain what I mean):

double [] data = new double [4][];
data[0] = new double[]{y[0], x[0][0], x[1][0], x[2][0]};
data[1] = new double[]{y[1], x[0][1], x[1][1], x[2][1]};
data[2] = new double[]{y[2], x[0][2], x[1][2], x[2][2]};
// ... etc.
data[199] = new double[]{y[199], x[0][199], x[1][199], x[2][199]};
Covariance covariance = new Covariance().computeCovarianceMatrix(data);
double [][] omega = covariance.getCovarianceMatrix().getData();

Then, when you're doing your actual regression, you have your covariance matrix:

MultipleLinearRegression regression = new GLSMultipleLinearRegression();
// Assumes you put your independent variables in x and dependent in y
// Also assumes that you made your covariance matrix as shown above 
regression.addData(y, x, omega); // we do need covariance
Bob Cross
Yes i tried doing that - my problem is that i have 200 items for each X so it is not a square matrix and GLS is complaining that org.apache.commons.math.MathRuntimeException$4: dimension mismatch 200 != 3Sorry I should have mentioned that in the problem statement i will edit it now
Bharani
@Bharani, updated the answer to try to address your comment.
Bob Cross
+1  A: 

If you have no idea of the covariance between the errors, I would use Ordinary Least Squares (OLS) instead of Generalized Least Squares (GLS). This amounts to taking the identity matrix as covariance matrix. The library appears to implement OLS in OLSMultipleLinearRegression .

Jitse Niesen
I tried OLS at first but it looks like GLS is what we need and i should find some way to estimate covariance
Bharani
+4  A: 

If you do not know the covariance between the errors you can take an iterative approach. You would first use Ordinary Least Squares, calculating the errors, and the covariances between the errors. You would then apply the GLS using the calculated covariance matrix and re-estimate the covariance matrix. You would continue iteration using GLS with the new covariance matrix until you have a convergence. Here is a link (.pdf warning) to an example of this method as well as a related discussion of Weighted and Iteratively Weighted Least Squares where you don't have a correlation between the errors as assumed in the GLS.

Mark Lavin
I see that the example is using R. Though there is nothing preventing myself from doing the same through java i guess time is the limiting factor. I was hoping that commons had a built in support for this. but it looks like they don't
Bharani
+1  A: 

Just came across Flanagan library that does this out of the box. Also got a mail from the commons user list that commons math at the moment does not support FGLS - automatic estimation of covariance matrix

-Bharani

Bharani
A: 

@Mark Lavin

You would first use Ordinary Least Squares, calculating the errors, and the covariances between the errors

Im a bit confused.. Since we have only one response variable, the residual errors should be 1 dimensional variable. Then where does a covariance matrix of errors fit in?

Moving from OLS to GLS you are breaking the assumption that the errors are independently normally distributed: var e ~ N(0,s^2*I) where I is the identity matrix. You are instead assuming that there is a covariance matrix C such that var e ~ N(0,s^2*C). You are then minimizing (y-Xb)'*C^(-1)(y-Xb) as opposed to (y-Xb)'*(y-Xb). Here C is a square matrix of size equal to the number of regression variables. The problem with GLS is that you have to know C already up to a multiplicative constant.
Mark Lavin
A: 

You need to organize the 3 random independent variates as column vectors in a matrix: x1, x2, x3 (N) where each row is a observation (M). This will be an MxN matrix.

You then plug this data matrix into a covariance routine provided by Apache such as: Covariance.computeCovarianceMatrix(RealMatrix matrix).

simon
Yes - done that apparently the covariance matrix that is required by GLS is MxM - try it with a unit test and you will get the error that i have already mentioned ( you would get a dimension mismatch in this case saying 200!=3 )
Bharani
So you want a 3x3 covariance matrix right in this case N=3 and M=200? Or is it the other way around?C = Covariance.computeCovarianceMatrix(RealMatrix myData).C should be 3x3 matrix which you then plug into GLS.
simon