First, you have to be aware that this approach will introduce three levels of latency for any communication between nodes:
- GPU memory on machine 1 to main memory on machine 1
- Main memory on machine 1 to main memory on machine 2
- Main memory on machine 2 to GPU memory on machine 2
A good first step will be to do some back of the envelop calculations to determine if the speed up you gain by splitting the problem between multiple machines will outweigh the latency you introduce.
Once you're sure the approach is the one you want to follow, then it's pretty much up to you to implement this correctly. Note that, currently, NVIDIA's CUDA or OpenCL libraries will be better choices for you because they allow you to access the GPU for computation without having it coupled with an X session. Once ATI's OpenCL implementation supports the GPU, then this should also be a viable option.
Since you already have a working GPU implementation, here are the basic steps you must follow:
- Determine how you update your factorization algorithm to support processing by separate nodes
- Set up the data exchange between N computers (I notice you have opted for MPI for this)
- Set up the scatter operation that will divide the input problem amongst the computational nodes
- Set up the data exchange between a machine and its GPU
- Set up the gather operation that will gather the results from the nodes into the one node