views:

698

answers:

6

I need to compute the nullspace of several thousand small matrices (8x9, not 4x3 as I wrote previously) in parallel (CUDA). All references point to SVD but the algorithm in numerical recipes seems very expensive, and gives me lots of things other than the null space that I don't really need. Is Gaussian elimination really not an option? Are there any other commonly used methods?

A: 

"seems very expensive" - what data do you have that supports this?

Maybe Block Lanczos is the answer you seek.

Or maybe this.

Both JAMA and Apache Commons Math have SVD implementations in Java. Why not take those and try them out? Get some real data for your case instead of impressions. It won't cost you much, since the code is already written and tested.

duffymo
Seems expensive in that the complexity is much higher than a row reduction method for example. I am not an expert but I can calculate the null space using Gaussian elimination by hand and as far as I am aware its unsuitability is down solely to rounding errors.
zenna
"seems" is the operative word here. This article suggests that it's O(m^3): http://www.cs.rpi.edu/~drinep/Papers/Drineas_PCI_01.pdf.
duffymo
I give this post a -1 for incorrect information.Gaussian Elimination can be used for solving Ax = 0. Isn't that just the null space?
Moron
Fair enough, you're correct.
duffymo
+3  A: 

Gaussian elimination is plenty fast for 4x3 matrices. IIRC I've done about 5 million per second with Java without parallelism. With such a small problem, your best bet is to code the routine (row reduce etc.) yourself; otherwise you'll waste most of the time putting the data into the right format for the external routine.

Rex Kerr
By 'not an option' I was referring to errors due to rounding being a problem or not. Or does it depend heavily on the application?
zenna
It depends heavily on the application. If it is okay to treat nearly null and actually null both as null, you set some epsilon that is considered "close enough to zero" and use Gaussian elimination. If it matters very much to you when things are close to singular but not quite, your numerical accuracy will be a lot better with SVD (typically). The larger your matrix, the worse it gets (typically), so now that you say 8x9, I'd more seriously consider SVD. Why not try out both methods with non-CUDA code and see if SVD is required?
Rex Kerr
+1  A: 

I think the most important thing for CUDA is to find an algorithm that doesn't depend on conditional branching (which is quite slow on graphics hardware). Simple if statements that can be optimized into conditional assignment are much better (or you can use the ?: operator).

If necessary, you should be able to do some form of pivoting using conditional assignment. It might actually be harder to determine how to store your result: if your matrix is rank-deficient, what do you want your CUDA program to do about it?

If you assume your 4x3 matrix is not actually rank-deficient, you can find your (single) null-space vector without any conditionals at all: the matrix is small enough that you can use Cramer's rule efficiently.

Actually, since you don't actually care about the scale of your null vector, you don't have to divide by the determinant -- you can just take the determinants of the minors:

    x1 x2 x3
M = y1 y2 y3
    z1 z2 z3
    w1 w2 w3

         |y1 y2 y3|        |x1 x2 x3|       |x1 x2 x3|        |x1 x2 x3|
->  x0 = |z1 z2 z3|  y0 = -|z1 z2 z3|  z0 = |y1 y2 y3|  w0 = -|y1 y2 y3|
         |w1 w2 w3|        |w1 w2 w3|       |w1 w2 w3|        |z1 z2 z3|

Note that these 3x3 determinants are just triple products; you can save computation by reusing the cross products.

comingstorm
A: 

Apparently it is possible to do Gaussian Elimination (and algorithms based on that) using CUDA constructs.

Here is one paper: http://academic.research.microsoft.com/Paper/4908208.aspx

Hope that helps.

Moron
+1  A: 

I wondered if the matrixes are related rather than just being random, so that the null spaces you are seeking can be considered to be like 1-dimensional tangents to a curve in N-space (N = 9). If so, you may be able to speed things up by using Newton's method to solve successive instances of the system of quadratic equations Ax = 0, |x|^2 = 1, starting from a previous null space vector. Newton's method uses first derivatives to converge to a solution, and so would use Gaussian elimination to solve 9x9 systems. Using this technique would require that you be able to make small steps from matrix to matrix by say varying a parameter.

So the idea is that you initialize using SVD on the first matrix, but thereafter you step from matrix to matrix, using the null space vector of one as the starting point for the iteration for the next one. You need one or two iterations to get convergence. If you don't get convegence you use SVD to restart. If this situation is what you have, it is much faster than starting fresh on each matrix.

I used this a long time ago to map contours in the solutions of sets of 50 x 50 quadratic equations associated with the behavior of electric power systems.

Permaquid
Interesting, in this case the matrices are not related but this method will surely be useful elsewhere, and after reading your response I have seen a lot more literature on this kind of updating method.
zenna
Oh yes, I used QR transformations, not SVD. There are lots of efficiencies to be had there. It's not so easy to remember details after more than 25 years though.
Permaquid
A: 

To answer your question directly... yes! QR decomposition!

Let A be an m-by-n matrix with rank n. QR decomposition finds orthonormal m-by-m matrix Q and upper triangular m-by-n matrix R such that A = QR. If we define Q = [Q1 Q2], where Q1 is m-by-n and Q2 is m-by-(m-n), then the columns of Q2 form the null space of A^T.

QR decomposition is computed either by Gram-Schmidt, Givens rotations, or Householder reflections. They have different stability properties and operation counts.

You are right: SVD is expensive! I can't speak for what state-of-the-art stuff uses, but when I hear "compute null space" (EDIT: in a way that is simple for me to understand), I think QR.

Steve
Thanks, Coincidentally after a lot of reading (much of which was through old school parallel implementations on vector machines in the 1980s) I had decided to attempt using the QR decomposition with givens rotations, will update this if it works out well.
zenna
Great! Glad to hear. Although you are using CUDA, I have Matlab code if you need any assistance.
Steve
I implemented a QR decomposition using givens rotations, I parallelised at 3 levels 1. The matrix multiplication between a row pair of matrix A and the 2 x 2 Givens matrix: using 16 threads for each element of the multiplication product 2. I do 4 row pairs in parallel, 2 for Q and 2 for A, and 3. I do the 25000 matrices in parallel. Overall I halved the run time from 20 ms to 9.5ms when compared to the SVD. So success!
zenna
Wow, that's awesome! Yes, these rotations should definitely be parallelizable. Now I want to try this myself. :-)
Steve
Yeah! get in touch if you need any assistance, was quite a complex job, might right a block post on it or something
zenna
This link looks really useful for this exact problem: http://www.cse.illinois.edu/courses/cs554/notes/11_qr.pdf
Steve
... and this one: http://www.cs.umd.edu/~oleary/reprints/j32.pdf (I took numerical optimization with Prof. O'Leary.)
Steve
Yeah I have read them both, that first one set me on a paper trail. Deciding which order to do givens rotations in is not trivial and there are quite a few papers just on that topic. I have collected some of the papers in a mendeley collection, http://www.mendeley.com/collections/1105691/Matrix-Parallel/
zenna