views:

118

answers:

1

I would like to compute the Moore–Penrose pseudoinverse of an enormous matrix. Ideally, I would like to do it on a matrix that has 23 million rows and 1000 columns, but if necessary I can reduce the number of rows to 4 million by only running on one part of my experiment.

Obviously, loading the matrix in to memory and running SVD on it is not going to work. Wikipedia points to Krylov subspace methods and mention the Arnoldi, Lanczos, Conjugate gradient, GMRES (generalized minimum residual), BiCGSTAB (biconjugate gradient stabilized), QMR (quasi minimal residual), TFQMR (transpose-free QMR), and MINRES (minimal residual) methods as being among the best Krylov subspace methods. But I don't know where to go from here. Is computing the pseudoinverse of such a huge matrix even feasible? If so, using which algorithms or software libraries? I have a large computing cluster available, so parallel approaches are welcome.

This answer points to the R package biglm. Would that work? Has anyone used it? I normally work in Python, but don't mind using other languages and tools for this particular task.

+1  A: 

You might be better off using a block iterative algorithm that converges directly to the least squares solution than computing the least squares solution through the pseudoinverse. See "Applied Iterative Methods" by Charlie Byrne. These algorithms are closely related to the Krylov subspace methods, but are tuned for easy computation. You can get an introduction by looking at chapter 3 of this preprint of another of his books.

deinst