views:

2086

answers:

15

So I finally took the time to learn CUDA and get it installed and configured on my computer and I have to say, I'm quite impressed!

Here's how it does rendering the Mandelbrot set at 1280 x 678 pixels on my home PC with a Q6600 and a GeForce 8800GTS (max of 1000 iterations):

Maxing out all 4 CPU cores with OpenMP: 2.23 fps

Running the same algorithm on my GPU: 104.7 fps

And here's how fast I got it to render the whole set at 8192 x 8192 with a max of 1000 iterations:

Serial implemetation on my home PC: 81.2 seconds

All 4 CPU cores on my home PC (OpenMP): 24.5 seconds

32 processors on my school's super computer (MPI with master-worker): 1.92 seconds

My home GPU (CUDA): 0.310 seconds

4 GPUs on my school's super computer (CUDA with static output decomposition): 0.0547 seconds

So here's my question - if we can get such huge speedups by programming the GPU instead of the CPU, why is nobody doing it??? I can think of so many things we could speed up like this, and yet I don't know of many commercial apps that are actually doing it.

Also, what kinds of other speedups have you seen by offloading your computations to the GPU?

A: 

It's the difference in the builds. GPUs are designed to put graphical calculations through quickly, as they all follow similar patterns, vector operations and stuff. Try doing it the other way around, putting general calculations to run the program through the GPU instead of the CPU, and you'd get the opposite effect.

AaronM
What is meant by general calculations? at the very bottom, its just numbers and simple math no?
masfenix
I literally mean general stuff. Numbers and simple maths included. GPUs aren't as fast as CPUs, so for normal stuff, they can't match up, it's like running a million digits of pi calculation on the computer you're using now against one more than 5 years old.
AaronM
I suppose it depends how you define "fast." If you mean "millions of operations per second," GPUs blow CPUs away. If you mean, "millions of sequential operations per second," CPUs have quite an edge. Different solutions for different problems.
WhirlWind
True, but for the most part, in the terms of what's being brought up in this, sequential is the aim, otherwise the CPU and GPU wouldn't have been split, we would just have had far more development on more GPU-like CPUs.Obviously this is a massive generalisation, and there are plenty of times when doing it one after the other is of no benefit at all, but we're talking about processors, it's all about generalising what we need, so it should be a fair assumption to make for the discussion, so long as we're aware it's not always the case.
AaronM
Part of the reason I think sequential operations per second are so important is because it's just been the way we've been taught to code for years and years. CPUs used to get faster every year at performing sequential operations, so programming in parallel wasn't important. While some things probably are inherently sequential, I bet that if people started focusing more on making parallelizable code and used some cleverness, we'd find that there's a lot more we can do on a GPU than we thought.
Chris
@Chris: Except that people have been working on parallelism for a long time, particularly now that any serious computer has at least two cores. There are more or less specialized applications where parallel processing shines (like graphics), but it turns out to be a really hard problem in general.
David Thornley
+14  A: 

The GPU is not a general-purpose architecture. The GPU is heavily optimized and parallelized for certain very specific computations that are critical when it comes to rendering 3d graphics. It turns out that, sometimes, the same types of computations are very useful for other purposes at well. But most of the time, that just doesn't work.

Justice
I do realize that there are lots of things that won't work, but I still think there are tons of other speedups that we can get from it that we currently aren't. I think graphics are only one of the many things that can benefit from SIMD.
Chris
There are definitely speedups to be obtained from programming the GPU. For example, if you want to encrypt a few petabytes of data via AES in ECB mode, you can gain a lot from running splitting up the work to a few hundred GPU cores running in parallel. However, please note that this is not general purpose computation. This is a particular problem which happens to be in the class of problems for which the GPU is extraordinarily useful.
Justice
+47  A: 

Many problems are not vectorizable, and not particularly adaptable to GPUs. In addition, few problems are bottlenecked by the CPU: they are limited by I/O bandwidth, or some other factor. In addition, transferring memory from the system to the GPU and back can involve substantial latency, and this can cause a bottleneck for certain problems that require much GPU-host interaction.

WhirlWind
I agree about the bottleneck transferring memory back and forth. That seems like one of the bigger things holding it back.
Chris
+1 for mentioning i/o bottlenecks.
pajton
I think it will come, though. There will be more and more things that people figure out. CUDA's slow adoption sort of reminds me of when the XML-request stuff went into HTTP; it took several years, but the effects were huge when AJAX came out of it.Another factor is that CPUs have been "good enough" for a long time, but they are increasingly hitting their limits, and we are seeing them evolve into more and more cores, rather than higher speeds; this will make programming GPUs more like programming CPUs than was previously the case.
WhirlWind
Another reason we aren't doing it: most programmers don't know how (yet).
WhirlWind
the biggest stopper is that most problems are not vectorizable by nature. To be so they'd have to be decomposable to many small independent sub tasks, and independent is the key word here.Memory performance would not be such an issue with non-linear tasks, where big increase in calculation performance would pay off for linear delay of data transfer.
n-alexander
@n-alexander That's at least partially true, though most real-world problems are not entirely scalar or vector problems; they fall somewhere in between. There are exceptions, of course
WhirlWind
@WhirlWind: The lack of knowledge is partly because the technique is of limited applicability. It's a lot to learn (particularly because it isn't standardized yet) if you don't have a good reason to do it. Hence, while a lot of people are working on this, most simply don't have enough use for GPU programming to make it worthwhile to learn.
David Thornley
+7  A: 

Compatibility is another point, CUDA works only for nVidia, and the other popular vendors their own technologies. OpenCL is supposed to help with this problem, but is a new standard.

Very nice question though, it is quite inspiring, thanks for the test results.

Chris O
Actually, I heard some guy got CUDA to run on an ATI card with his own bridge.
Xavier Ho
+4  A: 

Try running CUDA on the hardware of the graphic chips market leader - Intel. I don't even know if CUDA works on a GMA945, but if it does then performance would be interesting.

Edit: Just to be clear, my posting was a long way of saying "CUDA only runs on a Fraction of all PCs, so it's simply not feasible to replace CPUs which run in all x86 PCs"

Michael Stum
AFAIK, Intel doesn't support CUDA (nobody does except for NVIDIA). OpenCL is the open equivalent, and is supported by NVIDIA and AMD already. I suspect that Intel may support DirectX11's DirectCompute on Windows with future hardware.
Zifre
+42  A: 

Compatibility and portability is an issue. Not everyone has a beefy GPU (everyone can be counted on to have a GHz+ CPU, so you can rely on that being able to do a decent amount of work. The variation in GPU performance is huge by comparison. From anemic Intel integrated graphics to the latest SLI/Crossfire'd powerhouses from ATI and NVidia). your performance improvement just won't be an improvement on all computers. Some systems will run the software slower, if at all, because they just don't have the GPU power needed)

And of course, as others have mentioned, not every GPU vendor supports the same APIs. NVidia has done amazing things with CUDA from what I've seen, but no one else supports it. ATI and NVidia both support OpenCL, but Intel doesn't, as far as I know.

There's no API that everyone supports and which you can rely on being supported. So which API do you target? How do you make your app run on all your customers' computers? If you make GPU support an optional extra, it's additional work for you. And if you require GPU support, you cut off a large number of your customers.

Finally, not all tasks are suited for running on a GPU. The GPU is very specialized for parallelizable number-crunching. It doesn't speed up I/O-bound programs (which accounts for most of the sluggishness we see in our everyday computer usage), and as it doesn't have direct access to system memory, you also get additional latency transferring data between RAM and GPU memory. In some cases, this is insignificant, in others, it might make the whole exercise pointless.

And finally, of course, is inertia. Large established software can't just be ripped up and ported to run on a GPU. There's often a huge fear of breaking things when working with existing codebases, so dramatic rewrites such as this tend to be approached very carefully. And hey, if we've spent the last 10 years making our software run as well as possible on a CPU, it's probably going to take some convincing before we'll believe it could run better on a GPU. Not because it's not true, but because people are basically conservative, believe their own way is best, and dislike change.

jalf
Not "the last 10 years", the last **50** for a number of programs.
Donal Fellows
+10  A: 

GPUs are a temporary technology, like maths co-processors a couple of decades ago, like video cards a decade ago. If the stuff that these add-ons provide is general purpose enough and works well enough then that stuff will be on the next generation of CPUs. This is already happening with Intel promising us 80 cores on new CPUs some time soon.

If you have a code with, say, 10^6 lines of Fortran or C++, then it is likely to be far more cost effective to wait until Intel release the 80 core CPU, and the compilers to build onto it, than to invest a lot of time and effort in modifying the code to run on GPUs. GPUs which, as others have pointed out, are not terrifically standardised. And, if I did translate my million-line-code for GPUs, what will I have to translate it to in 3 years time ?

The technology is impressive, I have no arguments with that, but factor in the economics and GPUs look, from where I'm sitting, of doubtful value. I'm sitting, just to clarify, doing computational electromagnetics on large clusters and supercomputers. And I have been bombarded by outfits offering GPU-based 'solutions' promising huge speedups at low low cost. But, whenever I look closer I find 2 things:

  1. Today's GPUs are much less impressive at speeding up 64-bit f-p arithmetic than they are at speeding up 32-bit f-p arithmetic; and
  2. All the options require major rewrites, possibly even re-designs, of the codes.
High Performance Mark
Math co-processors were temporary, but all of their functionality simply transfered to CPUs. Which is exactly what Intel and AMD plan to do with GPUs.
Daniel
What about OpenCL? I didn't have a look at it; but I thought it was designed for a longer lifetime then CUDA and it supports more devices (also SSE for example)
Nils
+1  A: 

I checked out CUDA, and it appeared to me at the time to be a very terrible language to actually code in. When you use a CPU, then it's almost entirely your choice as to use any language that you'd like, but GPU languages are substantially harder to come by. If you don't like OpenCL or DirectCompute, you can forget writing code that most of your customers can use, or that your programmers can actually write.

DeadMG
Your argument could be paraphrased "Language X sucks because I don't know Language X."
John Dibling
@John: What's wrong with saying "Language X sucks because I don't know Language X"? I'd rather not learn a new language just to use some new hardware.
Andrew Grimm
@Andrew: That's not the language's problem.
John Dibling
@John: It is a problem, however. The fact remains that on a CPU there is a huge diversity of languages right now, and GPUs can't offer that.
DeadMG
+7  A: 

The great problem with GPGPU is that you have a lot of constraints when you're coding.

If you take CUDA as an example.

  1. The problem needs to be really parallelizable.
  2. You can't send a lot of data back and forth the main RAM and GPU ram. (It's dead slow).
  3. You need to micromanage ram and you start thinking how to access ram in the right way to speed it up due to caching and lots of very low-level stuff. (By changing small things in a program I made, I was able to increase the speed by 8-10 times, just by optimizing memory accesses ...)
  4. It's hard to debug.

Basically, the mandelbrot thing is absolutely awesome on the GPU since: 1. Embarrassingly easy to parallelize. (Every pixel can be calculated separately.) 2. Not much data sent at all.

However, multi-threaded programming is the future (and present), so any programmer today should try to be familiar with the concepts.

Maister
Yes I vote for hard to debug too..
Nils
Is it hard to debug because it's multi-threaded programming, or are there other aspects that make it hard to debug?
Andrew Grimm
It generally is hard to inspect the data the GPU is processing. It is generally: 1. Transfer data. 2. Do magic. 3. Transfer data back.
Maister
+5  A: 

An additional answer to "if we can get such huge speedups by programming the GPU instead of the CPU, why is nobody doing it???" is that the huge 45x speed up you achieved is out of line with what can normally be expected, even for programs that are well suited to GPU execution. The typical ratio between CPU FLOPS and GPU FLOPS simply isn't that high.

In your case, your 8800GTS is rated at 624 GFLOPS (single precision), while your CPU is rated at 38.4 GFLOPS, for a ratio of only 16. So, any speed up higher than this is probably not due your GPU being faster - rather it is because your CPU program is inefficient.

My guess is that that stream processing via SSE isn't being used, which would give close to a 4 fold speed increase on the CPU - making the ratio very close to the ideal speed up.
Future CPUs will extend this even further - making the advantage of GPUs smaller even for those problems that they are ideally suited for.

For comparison, there's a detailed study of CPU and GPU via Mandelbrot calculations at the link below, with speed ratios from 0.4x to 15x under different conditions.

http://www.bealto.com/mp-mandelbrot_intro.html

RD1
+5  A: 

I just want to clearly state the fact that a vast majority of software engineers are not trained to program GPUs. They need to be retrained to take advantage of the massive parallelization of the hardware and to understand the drawbacks (especially of the memory model). This is not something that can be done easily and I don't know if there is infrastructure in place to educate so many developers quickly.

Jose
Does "Not trained to" mean they went to a Java school? I went to a Java school too (which is unfortunate) and now I need to learn it by myself..
Nils
Sort of. I was talking more serial vs parallel programming. I would think that most of the developers writing code have done serial programming for the majority of their professional life. Of course, there are exceptions and this has to change as multicore hardware becomes more pervasive.
Jose
+1  A: 

It's not that easy. If you are used to OOP then you have to rethink some things a bit when coding on a GPU. CUDA supports a subset of C as device code, so you can't use your classes, the highest thing so far, that you can use on a GPU is a struct. But I ended up packing my stuff in 1d Arrays most of the time, which makes the code less readable and more error prone. Also debugging on a GPU is not easy as some people pointed out above.

Most of the CUDA examples are quite small and I think it won't be easy to write a larger software package with GPGPU support. I'm not sure what's the status is, but what about programs like Photoshop, Final Cut? Also raytracing seems to be quite fast.. What about software like Blender, Cinema4d, POV-Ray? Shipping a 3d package with a raytracer which supports CUDA (or OpenGL) is probably a feasible solution.

I personally think that many developers will use the GPU through a library in the future, for doing computational intensive stuff, such as linalg (CUBLAS), fft or something like this.

Nils
"You can't send a lot of data back and forth the main RAM and GPU ram. (It's dead slow)."Forgot to mention this too, you have to adjust your algorithm so it fits to the memory hierarchy of the GPU. Google for example for "cuda matrix multiplication shared memory"This makes programming even more brain time intensive and since brain time is usually the sparse resource..
Nils
+1  A: 

Check out this podcast: http://www.blogtalkradio.com/teachparallel/2009/06/09/teachparallel-dr-wen-mei-hwu-solving-data-intensive-problems

The main point is it's hard to do and takes enormous amount of time.. (and nerves ;) Also to mention, that the debugger (cuda-gdb) was released just recently. (For Linux, for Windows there is a beta of something for VS) I can also recommend the papers by Wen Mei Hwu just google.

Nils
+1  A: 

I was also impressed going into this but first you have the NVidia cards only issue which is a big turnoff for most open source applications, and in a similar way the uncertainty and immaturity of the field in that you don't really know if the trend will continue (I dont mean the multi core trend which is incredible I mean the offloading computation to the graphics card)

The whole methodology frankly seems quite suspect in terms of global engineering, it has the smacking of organic systems such as the brain that are governed by random factors and not a well planned HPC platform, my guess is that this will change in the very near future to incorporate multicore chips (with less strictness of context etc...) into the mother board much like math processors were incorporated in the past.

Intel and AMD will have the last word here and not only in a bad sense since they have better library writing skills than NVidia.

OpenCL and DirectCompute but mostly OpenCL are also forces to be reckoned with, I mean the whole paradigm balances on the fact that your system code will scale with time and allow larger Ns to be processed, if the hardware shifts around and causes you to rewrite in a different SDK once every several years this whole paradigm collapses.

Veltz
+1  A: 

Well, two main reasons that I can think of:

First, a lot of people still don't have the hardware required to get the kind of speedups seen on powerful GPU machines.

Second, it's still a pretty niche place in terms of programming. It's a (somewhat) new practice and it requires some extra work to get up and running. At least right now, an average programmer that has CUDA experience is hard to find, especially outside of academia.

This is largely due to the first reason in that there isn't a huge CUDA market share. Reasons for CUDA's specific success or lack of success is a different discussion. It would probably center around Nvidia's decision to keep CUDA very much proprietary. A problem with this is that developers and users who have the interest and hardware resources to utilize are very likely running on an open source platform and place value in open source ideals. Time is really what will tell in terms of how this decision turns out for Nvidia. Either way, as it stands, there is some cool stuff to do with CUDA.

Sean O'Hollaren