views:

247

answers:

2

i'm trying to do factor analysis for a co-occurrence matrix(C) , which is computed from the term-document matrix(TD) as follows: C=TD*TD'

In theory C should be positive semi-definite , but it isn't and the factor analysis algorithm can't work with it because of this.I can't change the algo because of speed reasons).

I look it up and it might be a numerical stability issue : http://stackoverflow.com/questions/619335/a-simple-algorithm-for-generating-positive-semidefinite-matrices - answer 2.

What's a good way to proceed here ?

+7  A: 

I would do an eigendecomposition of the matrix:

C=Q D Q^-1

If your matrix really is positive semidefinite, then all of the eigenvalues (the entries on the diagonal of D) should be non-negative. (This is probably the test that your factor analysis algorithm is doing as well to see if the matrix is positive semidefinite.)

If you're suffering numeric problems, some of the eigenvalues will probably be barely smaller than zero. Try setting these entries to zero, compute Q D Q^-1 to get a new, corrected C, then submit that to your factor analysis algorithm.

On the other hand, if you find that your matrix C has truly negative eigenvalues, then you know that you're going wrong somewhere in the construction of C.

Martin B
thanks. how small the eigenvalues need to be to be considered a numerical error ? i get negative values between -9e-15 to -1e-32.
yaniv
That's definitely within the range of numerical error. Do you have access to the source code of your factor analysis routine? It's likely it's doing an eigendecomposition itself -- in which case, you could simply modify the code to allow slighly negative eigenvalues. Otherwise, use the method I suggested above -- but beware that after multiplying `Q D Q^-1` together again, you may get new numerical errors that take your eigenvalues slightly below zero again. One thing you could try in that case is to make the eigenvalues very slightly positive instead of exactly zero.
Martin B
I am a little late to the ball here, but I thought I would post a comment. You need to be careful computing the eigen-decomposition of C, as when forming C you are doublingthe condition number of of TD, which could lead to numerical instabilities. With that in mind, I would look at the condition number of TD. It may be that when forming these matrices, they are very badly conditions, leading to poor results when forming and using C.
SplittingField
A: 

Not being able to comment, I will have to echo SplittingField's comment, with the nitpick that forming C=TD*TD' squares the condition number of TD, not double it. Equivalent and much more stable to finding the eigendecomposition of C would be to perform singular value decomposition (SVD) on TD. You get the condition number as a bonus: the ratio of the largest singular value to the smallest singular value is the condition number of the matrix, and the base-10 logarithm of that is an estimate on how much decimal digits you stand to lose had you gone through with using C in your calculations (of course, if the tiniest singular value is 0, your problem is singular!)

J. M.