views:

202

answers:

6

I have a system of 6 equations that I need to solve over and over again in a program (with many different inputs of course). I am currently using the Cramer's rule method of solving the system and it works quite well (it seems that my processor really likes add and multiply operations, it gets solutions in 1 microsecond despite the explicit equations being over 2 pages in length). However the number of times i need to solve is huge and I'm looking for an even faster method.

The question is, is there an even faster or more efficient method for solving these equations or would something like CUDA be beneficial here?

+2  A: 

You might check out Boost's uBLAS.

The method is not as straightforward, however; you'll want to look into LU decomposition.

GMan
or any other library that offers things like LU decomposition, forward/backward substitution (+1)
sellibitze
Is LU decomposition a computer friendly method to solve things? I programed one of those algorithms years ago on my TI-83 calculator while terribly bored in numerical methods class, as i recalled, it used a lot of divisions though which is not very computer friendly. I'll take a look at it again, maybe I can derive some general equations to hard code into the program.
Faken
@aaa: the "form" of my equations are known and does not change, just the values are different (theres also zeros and ones in the matrix when formed into a form usable by Cramer's rule). uBLAS won't take that into consideration would it?
Faken
@aaa: While it's generic, it should optimize quite heavily. Of course Faken will just have to try it and profile to see which method is faster.
GMan
@Gman: Ahh, no profiler on my VS2008, also don't know how to use one yet as I've never had to (at least as of yet anyways). Best I do is manualy time individual functions...
Faken
sure, hard to predict beforehand.@Fake if you do decide to try ublas, maybe worthwhile to use bounded_matrix since it gives compile time dimensions.
aaa
A: 

You could get at least a doubling by using SSE2 or higher. But that would pale in comparison to a CUDA or OpenCL port, which, if done right, could yield a speed-up of one or two orders of magnitude.

If you know Python, PyCUDA might be a good entry point.

Marcelo Cantos
I'm currently using VS2008 on a core i7 processor, would SSE2 already be enabled by default? If not, how can I enable it? Also, conceptually, what would be the best way to go about implementing CUDA on a very high level (such as one thread to generate values to be calculated, a thread to handle loading and retrieving data from CUDA, and one for processing the results, ect)?
Faken
Definitely. SSE2 was introduced almost ten years ago. The Core i7 architecture supports [SSE4.2](http://en.wikipedia.org/wiki/SSE4). I can't help you on CUDA, I'm afraid, I haven't play around with it much.
Marcelo Cantos
+3  A: 

Cramer's rule doesn't scale well. For little system of equations with two or three unknowns it's fine but if the system gets larger, other methods are more efficient, for example: LU decomposition + forward substitution + backward substitution.

sellibitze
Yea, I know...the equations are huge. I'll take a look at LU. In your opinion, should i look at using a general library to solve them or should i just stick to trying to find more efficient methods interims of mathematical methods as I know details about exactly the form of the system of equations?
Faken
LU works if you are resolving a system Ab=y, multiple times. First run is expensive, subsequent runs are fast.
joel3000
+1  A: 

If you wanna run CUDA, you need a decent Nvidia graphics card

If you have an Intel CPU, I recommanded you use Intel's MKL http://software.intel.com/en-us/intel-mkl/ , which is optimized for Intel CPU,

If you use CUDA, you might have problems with float or double precision issue

Plus, if you are not familiar with GPU programming, you are gonna spend more time on the CUDA solution

shader
Aww...$400+ yea, thats out of my operating budget. Maybe my university has some licences, thought that would rule out being able to work from home. I'm aware of the single/double problem with CUDA, currently I'm using double simply because I take no speed penalties due to it. However if i use Cramer's rule i should be able to get away with single point because of the lack of divisions.
Faken
@Fake actually, addition and subtraction are the main source of error. Multiplication and division not so much
aaa
@shader: Heh, that's dirty...
Faken
shader
I think single precision should be good enough for me actually. I'm dealing with a physical system in which my control error is already way beyond error from single precision math.
Faken
What is your graphics card model?
shader
@shader: Home computer is 8800GTX, university computer is 310 GT...which is a totally useless card (yea, it's actually discrete too! even more sad is the fact it's paired up with a core i7 860 processor). Both cards are CUDA capable, if i can prove a small speed advantage on the 310 GT, i should be able to convince my proff to get me a card with some real horsepower.
Faken
@Faken, you can start with 8800GTX, it's a bit faster. According to my own experience, GTX480 is 8 times faster than 9800GTX. For now, you can either try to write your own CUDA application or use some libs: http://gpgpu.org/tag/linear-algebra
shader
+3  A: 

perhaps you could give http://arma.sourceforge.net/docs.html a try.

It provides premade solve function, http://arma.sourceforge.net/docs.html#solve. however it uses atlas/lapack backand, which is geared more towards larger functions.

You can also try multiplication by inverse, http://arma.sourceforge.net/docs.html#inv, which is compile time template and maybe faster for your purposes.

try this: x = inv(A)*b . Since A does not change, inversion is done only once. Then you are home free with simple matrix vector multiplications, which will be really fast

aaa
@GMan removed it. On other note, I am working on cublas/ublas bridge and I looking for partners. do you know someone (or yourself)who may be interested? I think I saw somewhere you mentioned using cuda
aaa
@aaa: Do you mean allow Boost's uBLAS (or some other BLAS library) to take advantage of CUBLAS? I'm interested, per se, but I don't have time to work on anything else. :( I have indeed used CUDA, but only to poke around, I've never done anything serious with it.
GMan
@GMan sure, no problem. I thought you may have known somebody.it's more like having ublas expressions but in GPU memory using cublas kernels and few functions to communicate:here is small test case: http://code.google.com/p/asadchev/source/browse/trunk/projects/boost/numeric/bindings/cublas/test.cpp
aaa
@aaa: That's a pretty awesome little project you got there. :) I hope you can find someone to help finish.
GMan
The invert method is likely really the best solution without going to CUDA (which I'll explore further if this method is insufficient). Doing some clever substitutions with matlab we can whittle down the equations to something quite reasonable. I'll still need to calculate the inverse many times, but at least its no longer running on the iteration side of the program. Though the downside is that there are now many little equations running around everywhere. Oh well, thanks for the ideas!
Faken
@GMan thanks. There is other stuff as well there, unfortunately not yet documented, I wrote to support main application. Feel free to poke around if you want, I an desperate for feedback.
aaa
A: 

Unless you can solve your equation in non-sequential order, CUDA won't help. In fact, CUDA maybe be slower. Anything not embarrassingly parallel won't benefit from CUDA. Enabling SSE2 via compiler switch isn't enough. You need a library which is coded to use SSE2. In my taste, the best linear algebra library is Eigen. It is very easy to use and supports SIMD (not only SSE2).

What do you mean by solving the equations in non-sequential order? I know now that this problem is actually a multi variable optimizing problem (learning stuff as I go, this project has a habit of doing that to me). However its an optimizing problem within an optimizing problem, thus i can parallelize individual optimizing problems. If I get the main CPU to setup the problems, would CUDA be able to adjust parameters itself and do iterations without being explicitly fed data by the CPU? As opposed to CPU prepping a matrix and CUDA solving it and simply returning it, nothing else?
Faken
Think of GPU as a multicore processor. GPU has many cores, but each core is much weaker than CPU core. GPU has its strength from parallelization. Can you split the whole problem into sub-tasks which can be executed interdependently? Say, you have equations 1, 2, ..., N. Can you solve them independently? If so, CUDA may help. You could try to parallelize your code on CPU first, because making the same with CUDA is harder. In my experience, linear algebra is notoriously hard to parallelize, unless of course, the problem consists of independent sub-tasks.
CUDA is a programming language similar to C. GPU coding involves explicit memory management. You must be very careful about moving data around. Solving one 6x6 equation and returning the answer to CPU doesn't justify the overhead. In order to benefit from GPU, your algorithm must be able to streamline very large number of equations at once. There should be no dependency between streamlined equations. Then GPU can solve all these equation faster than CPU.