How to scale cholesky factorization on multiple GPUs | ansaurus

tags:

views:

211

answers:

3

+2 Q:

How to scale cholesky factorization on multiple GPUs

Hello folks,

I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. Now I want to exploit computation power of more and more GPUs and I want to run this code on multiple GPUs.

Currently I have One Machine and One GPU installed on it and cholesky factorization is running properly. I want to do it for N machine and all have one GPU installed on them. So Suggest me how should I proceed.

A:

It's a very specialised question. Suggest you check the Stream developer resources and the Stream Developer Forums.

Mitch Wheat 2009-09-09 06:16:31

+2 A:

First, you have to be aware that this approach will introduce three levels of latency for any communication between nodes:

GPU memory on machine 1 to main memory on machine 1
Main memory on machine 1 to main memory on machine 2
Main memory on machine 2 to GPU memory on machine 2

A good first step will be to do some back of the envelop calculations to determine if the speed up you gain by splitting the problem between multiple machines will outweigh the latency you introduce.

Once you're sure the approach is the one you want to follow, then it's pretty much up to you to implement this correctly. Note that, currently, NVIDIA's CUDA or OpenCL libraries will be better choices for you because they allow you to access the GPU for computation without having it coupled with an X session. Once ATI's OpenCL implementation supports the GPU, then this should also be a viable option.

Since you already have a working GPU implementation, here are the basic steps you must follow:

Determine how you update your factorization algorithm to support processing by separate nodes
Set up the data exchange between N computers (I notice you have opted for MPI for this)
Set up the scatter operation that will divide the input problem amongst the computational nodes
Set up the data exchange between a machine and its GPU
Set up the gather operation that will gather the results from the nodes into the one node

Eric 2009-09-10 08:46:33

A:

I showed this Q to a colleague of mine who knows about these things. He suggested you use ScaLAPACK.

Die in Sente 2009-09-11 16:30:13

related questions

What's the best way to unit test concurrent Erlang code?

How can i connect two or more machines via tcp cable to form a network grid?

How are you taking advantage of Multicore?

Start stored procedures sequentially or in parallel

Using Parallel.For to test SQL queries and comparison with the ThreadPool

Multithreaded image processing in C++

RT parallel processing in Rails

What is a good textbook for Parallel Computing?

Easy parallelisation

How to wait untill all child processes called by fork() complete?

Free OpenMosix replacement?

What is the easiest way to parallelize my C# program across multiple PCs

Passing values with Parallel Extensions and VB.net

What's the best way of executing tasks in parallel in Ksh and Perl?

What are some practical problems that parallel computing, f#, and GPU-parallel processing might solve.

How to paralleize search for a string in a file with a help of fork? (GNU Linux/g++)

Unit Testing, Deadlocks, and Race Conditions

How would you simply Monitor.TryEnter

How would you simplfy Entering and Exiting a ReaderWriterLock?

Which parallel programming APIs do you use?

How does NUnit (and MSTest) handle tests that change static/shared variables?

MPI for multicore ?

Is it possible that F# will be optimized more than other .Net languages in the future?

What parallel programming model do you recommend today to take advantage of the manycore processors of tomorrow?

What are the current best options for parallelizing a CPU-intensive .NET app?