Have you successfully used a GPGPU?

+6 A:

I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing. I've been getting about a 10-20X speedup in these situations.

From what I've read, this is fairly typical for data-parallel algorithms.

rck 2008-09-10 22:05:43

+10 A:

I have written trivial applications, it really helps if you can parallize floating point calculations.

I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).

Tony BenBrahim 2008-09-11 04:03:27

Great link - thanks.

Greg Whitfield 2008-09-27 12:41:01

Definitely a very good link but it appears to be broken. Try http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html instead.

Tim Whitcomb 2009-05-21 23:30:52

The link seems to have expired

ccook 2009-12-09 13:17:35

+1 A:

While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).

This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.

In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.

Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.

Morten Christiansen 2008-09-15 11:34:18

+3 A:

I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.

Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.

Also, since CUDA is a C extension, integrating it into another application can be non-trivial.

biozinc 2008-10-23 09:21:24

I don't think it's difficult at all to integrate. After all, with a simple 'extern "C"' it's just standard C-style linkage. Most anything should be able to deal with that. It was < 20 minutes to get a quick test running linked to a Qt app. (After getting the SDK installed and the samples compiled, anyway) If you're meaning from managed code... well... IMHO square peg round hole.

darron 2009-06-29 16:15:24

+22 A:

I have been doing gpgpu development with ATI's stream SDK instead of Cuda. What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)

A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.

For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.

For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.

Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.

Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.

Die in Sente 2008-11-25 22:18:40

+7 A:

I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).

IMO, there are three typical stages when porting an algorithm to CUDA:

Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).

This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.

Tom Pohl 2009-05-12 14:11:51

I agree on those steps. However, the first step must have a very special care, since if you do too much conditionals, or looping on kernels and don't analyze your problem first (if it is well suited to run in parallel in that implementation) you will be out of luck

Edison Gustavo Muenz 2009-05-12 14:38:25

This also mimics my experience. I knew my problem (n-body) would parallelise. A trivial implementation didn't take that long to implement and was x10 faster than a C++ implementation on an i7 920 CPU. Rethinking the alrogithm to make better use of shared memory etc has already given me a 3-4x speedup but required a lot more though.

Ade Miller 2009-10-01 17:02:02

+1 for "nobody understands it (including you)".

Andrew Grimm 2010-09-24 00:54:14

+1 A:

Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.

It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).

I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.

The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.

Edison Gustavo Muenz 2009-05-12 14:35:02

A:

I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right

zenna 2009-05-21 22:15:10

A:

GG 2009-09-09 15:25:12

What are the X and Y axes?

Andrew Grimm 2010-09-24 00:55:55

problem size and speedup :)

GG 2010-09-29 09:45:13

ansaurus

tags:

views:

answers:

Have you successfully used a GPGPU?

related questions