tags:

views:

1757

answers:

9

I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?

+6  A: 

I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing. I've been getting about a 10-20X speedup in these situations.

From what I've read, this is fairly typical for data-parallel algorithms.

rck
+10  A: 

I have written trivial applications, it really helps if you can parallize floating point calculations.

I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).

Tony BenBrahim
Great link - thanks.
Greg Whitfield
Definitely a very good link but it appears to be broken. Try http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html instead.
Tim Whitcomb
The link seems to have expired
ccook
+1  A: 

While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).

This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.

In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.

Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.

Morten Christiansen
+3  A: 

I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.

Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.

Also, since CUDA is a C extension, integrating it into another application can be non-trivial.

biozinc
I don't think it's difficult at all to integrate. After all, with a simple 'extern "C"' it's just standard C-style linkage. Most anything should be able to deal with that. It was < 20 minutes to get a quick test running linked to a Qt app. (After getting the SDK installed and the samples compiled, anyway) If you're meaning from managed code... well... IMHO square peg round hole.
darron
+22  A: 

I have been doing gpgpu development with ATI's stream SDK instead of Cuda. What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)

A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.

For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.

For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.

Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.

Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.

Die in Sente
+7  A: 

I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).

IMO, there are three typical stages when porting an algorithm to CUDA:

  1. Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
  2. Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
  3. Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).

This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.

Tom Pohl
I agree on those steps. However, the first step must have a very special care, since if you do too much conditionals, or looping on kernels and don't analyze your problem first (if it is well suited to run in parallel in that implementation) you will be out of luck
Edison Gustavo Muenz
This also mimics my experience. I knew my problem (n-body) would parallelise. A trivial implementation didn't take that long to implement and was x10 faster than a C++ implementation on an i7 920 CPU. Rethinking the alrogithm to make better use of shared memory etc has already given me a 3-4x speedup but required a lot more though.
Ade Miller
+1 for "nobody understands it (including you)".
Andrew Grimm
+1  A: 

Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.

It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).

I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.

The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.

Edison Gustavo Muenz
A: 

I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right

zenna
A: 
GG
What are the X and Y axes?
Andrew Grimm
problem size and speedup :)
GG