views:

1351

answers:

15

Your CPU may be a quad-core, but did you know that some graphics cards today have over 200 cores? We've already seen what GPU's in today's graphics cards can do when it comes to graphics. Now they can be used for non-graphical tasks as well, and in my opinion the results are nothing short of amazing. An algorithm that lends itself well to parallelism has the potential to be much, much faster on a GPU than it could ever be on a CPU.

There are a few technologies that make all of this possible:

1.) CUDA by NVidia. It seems to be the most well-known and well-documented. Unfortunately, it'll only work on NVidia video cards. I've downloaded the SDK, tried out some of the samples, and there's some awesome stuff that's being done in CUDA. But the fact that it's limited to NVidia cards makes me question its future.

2.) Stream by ATI. ATI's equivalent to CUDA. As you might expect, it will only work on ATI cards.

3.) OpenCL - The Khronos Group has put together this standard but it's still in its infancy stages. I like the idea of OpenCL though. The hope is that it should be supported by most video card manufacturers and should make cross-video card development that much easier.

But what other technologies for non-graphical GPU programming are coming and what shows the most promise? And do you see or would you like to see these technologies being built into some of the mainstream development frameworks like .NET to make it that much easier?

A: 

I expect the same things that CPUs are used for?

I just mean this seems like a gimmick to me. I hesitate to say "that's going nowhere" when it comes to technology but GPUs primary function is graphics rendering and CPUs primary function is all other processing. Having the GPU do anything else just seems whacky.

Spencer Ruport
Yeah, somewhat. The advantage of the GPU is that it's massively parallel capabilities. Today, each processor of a GPU is maybe half the speed of a single CPU's core. But in some cards there are over 200 of them. This massively parallel ability complicates the hell out of many programming tasks, but those that leverage this well will be reaping the benefits.
Steve Wortham
You should also note that (currently) there are severe limits of what the code may do. For example (if memory fails me not) recursion was forbidden. And the code for each core had to be the same - you could not have different code for each core. Etcetera. All in all, the GPU is not nearly a replacement for CPU. But it does lend itself to certain kinds of tasks.
Vilx-
I just meant this seems more like a gimmick than anything. I don't expect anything terribly useful to come out of it.
Spencer Ruport
A gimmick? Are you kidding me? We're going to see increased separation like this as time goes on, because no one processor is going to be able to do everything the fastest. We have different engines for different automobiles for different purposes, why wouldn't computer processing go the same way?
Sneakyness
@Sneakyness - "no one processor?" The trend is to multiprocessors, and a lot of them. And they don't suffer from the performance penalty of doing CPU <-> GPU <-> CPU memory movements. Which can be quite costly. I/O means a lot in performance-critical applications.
xcramps
Ye4s, the trend is multiple cores. Like those "whacky" GPUs have. Take a gander at the PS3's "CPU". That's where the (likely?) future lies. And @Spencer you really need to broaden your (intellectual) horizon.
jae
@jae, @sneakyness - Perhaps neither of you remember a time before GPUs became standard. The separation between graphics processors and central processors came about for a reason. Trying to use a GPU as a CPU seems like trying to make our way back into the 90s to me.
Spencer Ruport
+3  A: 

Pretty much anything that can be paralleled may be able to benefit. More specific examples would be SETI@home, folding@home, and other distributed projects as well as scientific computing.

Especially things that heavily rely on floating point arithmetic. This is because GPUs have specialized circuitry which is VERY fast at floating point operations. This means its not as versatile, but it's VERY good at what it does do.

If you want to look at more dedicated GPU processing, check out Nvidia's Tesla GPU. It's a GPU, but it doesn't actually have a monitor output!

I doubt we will see too much GPU processing on the common desktop, or at least for a while, because not everyone has a CUDA or similar capable graphics card, if they even have a graphics card at all. It's also very difficult to make programs more parallel. Games could possibly utilize this extra power, but it will be very difficult and probably won't be too useful, since all graphics calculations are mostly already on the GPU and the other work is on the CPU and has to be on the CPU due to the instruction sets.

GPU processing, at least for a while, will be for very specific niche markets that need a lot of floating point computation.

samoz
This is a stretch, and a generalization being made a lot here: the GPU is not just a parallel processing model, it also has a different memory model. Random access is not great for GPUs, which hurts a lot of processes like raytracing. Additionally, you have to be able to do significant processing on the GPU without too much back and forth from the CPU, which again reduces the applicability.
Chris
A: 

Your perception that GPUs are faster than CPUs is based on the misconception created by a few embarassingly parallel applications applied to the likes of the PS3, NVIDIA and ATI hardware.

http://en.wikipedia.org/wiki/Embarrassingly_parallel

Most real world challenges are not decomposable easily into these types of tasks. The desktop CPU is way better suited for this type of challenge from both a feature set and performance standpoint.

Nissan Fan
You're right. GPU's are not always faster. They're only going to be faster for the algorithms that lend themselves well to parallelism.
Steve Wortham
I don't see how that is a misconception. GPU's *are* insanely fast, but of course they're less general than CPU's. It sounds like the OP is well aware of that. He's not asking whether *every* programming task is going to migrate to the GPU. Of course it isn't.
jalf
Your assertion that "most real world challenges are not decomposable easily into embarrassingly parallel tasks" is somewhat off-base.Much scientific computing and a vast majority of financial mathematics does decompose in this manner; in this light we would call these supercomputers and far more powerful than the general purpose computers for those tasks.
polyglot
+1  A: 

I have heard a great deal of talk about turning what today are GPU's into more general-purpose "array proceesor units", for use with any matrix math problem, rather than just graphics processing. I haven't seen much come of it yet though.

The theory was that array processors might follow roughly the same trajectory that float-point processors followed a couple of decades before. Originally floating point processors were expensive add-on options for PC's that not a lot of people bothered to buy. Eventually they became so vital that they were put into the CPU itself.

T.E.D.
+4  A: 

Monte Carlo is embarrassingly parallel, but it is a core technique in financial and scientific computing.

One of the respondents is slightly incorrect to say that most real world challenges are not decomposable easily into these types of tasks.

Much tractible scientific investigation is done by leveraging what can be expressed in an embarrassingly parallel manner.

Just because it is named "embarrassingly" parallel does not mean it is not an extremely important field.

I've worked in several financial houses, and we forsee that we can throw out farms of 1000+ montecarlo engines (many stacks of blades lined up together) for several large NVidia CUDA installations - massively decreasing power and heat costs in the data centre.

One significant architectural benefit is that there is a lot less network load also, as there are far less machines that need to be fed data and report their results.

Fundamentally however such technologies are at a level of abstraction lower than a managed runtime language such as C#, we are talking about hardware devices that run their own code on their own processors.

Integration should first be done with Matlab, Mathematica I'd expect, along with the C APIs of course...

polyglot
+11  A: 

I think you can count the next DirectX as another way to use the GPU.

From my experience, GPUs are extremely fast for algorithms that are easy to parallelize. I recently optimized a special image resizing algorithm in CUDA to be more than 100 times faster on the GPU (not even a high end one) than a quad core Intel processor. The problem was getting the data to the GPU and then fetching the result back to main memory, both directions limited by the memcpy() speed on that machine, which was less than 2 GB/s. As a result, the algorithm was only slightly faster than the CPU version...

So it really depends. If you have a scientific application where you can keep most of the data on the GPU, and all algorithms map to a GPU implementation, then fine. Else I would wait until there's a faster pipe between CPU and GPU, or let's see what ATI has up their sleeves with a combined chip...

About which technology to use: I think once you have your stuff running in CUDA, the additional step to port it to OpenCL (or another language) is not so large. You did all the heavy work by parallelizing your algorithms, and the rest is just a different 'flavor'

chris166
Thanks Chris, it's good to see some first hand experience on the topic. That's a good way to look at CUDA. I've been a little standoff-ish to spend too much time developing in CUDA because I imagine some of the open standards will soon replace it. I'd be like a Glide developer -- haha. But perhaps I shouldn't worry about that too much. It'll be good experience no matter what.
Steve Wortham
Great answer, Chris. A lot of people are currently being disappointed that GPU accelleration doesn't just magically make anything faster. Even if the problem is highly parallel, if the numeric intensity is low (i.e. the ratio of floating-point operations to memory access), then even if the GPU was infinitely fast, the data transfer cost is greater than just doing the work on the CPU. Algorithms in that class include dot products, matrix-vector operations, FFs, anything with small matrixes.
Die in Sente
+3  A: 

I foresee that this technology will become popular and mainstream, but it will take some time to do so. My guess is of about 5 to 10 years.

As you correctly noted, one major obstacle for the adoption of the technology is the lack of a common library that runs on most adapters - both ATI and nVidia. Until this is solved to an acceptable degree, the technology will not enter mainstream and will stay in the niche of custom made applications that run on specific hardware.

As for integrating it with C# and other high-level managed languages - this will take a bit longer, but XNA already demonstrates that custom shaders and managed environment can mix together - to a certain degree. Of course, the shader code is still not in C#, and there are several major obstacles to doing so.

One of the main reasons for fast execution of GPU code is that it has severe limitations on what the code can and cannot do, and it uses VRAM instead of usual RAM. This makes it difficult to bring together CPU code and GPU code. While workarounds are possible, they would practically negate the performance gain.

One possible solution that I see is to make a sub-language for C# that has its limitations, is compiled to GPU code, and has a strictly defined way of communicating with the ususal C# code. However, this would not be much different than what we have already - just more comfortable to write because of some syntactic sugar and standard library functions. Still, this too is ages away for now.

Vilx-
Thanks Vilx. I agree. Unfortunately, I'm afraid it will be quite a few years wait before this is truly mainstream like you said. I hate to mention Glide again, but this reminds me of the days when 3d acceleration was just taking off. Glide was popular on 3dfx cards around the time OpenGL was just gaining steam. But Glide would ONLY work on 3dfx cards. So OpenGL (and later Direct3D) completely replaced Glide.
Steve Wortham
I expect the history to repeat itself here.
Vilx-
re: common library running on both ATI and nVidia: Apple plans to address this in their corner of the market (for Mac OS X users who upgrade or buy new machines) with Snow Leopard, shipping with OpenCL + drivers built in. Excited to see what developers dream up since apps just need to require Snow Leopard and aside from that, the program should "just work"---no drivers to install.
Jared Updike
Nice move. Hopefully this will speed up things on the PC side as well. :)
Vilx-
re: C#+GPU We came so close to this being a reality in this generation of .NET. Microsoft has added full expression tree support to .NET 4.0, unfortunately they did not add accompanying support in any of the languages to be able to define entire method bodies as expressions. This is a shame because if they had I could very easily have added natural language support for CUDA in almost no time at all (assuming I had a PTX+compiler expert to help me out). Not just CUDA either, I could even choose to target OpenCL or DirectX Compute instead (or as well).
Drew Marsh
Wouldn't it also be possible to use something like monos cecil to get the IL of a method in question and compile that to ptx/whatever? that shouldn't be "too" hard to do, however whether or not that would make for a natural language integration is still up to discussion considering there would be a lot of restrains what could be done in the method (no exceptions, no memory allocations...) and the input/ouput would have to be transported to/from the device somehow (avoiding unecessary transfers when possible for it to be of any use)
Grizzly
+2  A: 

It's important to keep in mind that even tasks that are inherently serial can benefit from parallelization if they must be performed many times independently.

Also, bear in mind that whenever anyone reports the speedup of a GPU implementation to a CPU implementation, it is almost never a fair comparison. To be truly fair, the implementers must first spend the time to create a truly optimized, parallel CPU implementation. A single Intel Core i7 965 XE CPU can achieve around 70 gigaflops in double precision today. Current high-end GPUs can do 70-80 gigaflops in double precision and around 1000 in single precision. Thus a speedup of more than 15 may imply an inefficient CPU implementation.

One important caveat with GPU computing is that it is currently "small scale". With a supercomputing facility, you can run a parallelized algorithm on hundreds or even thousands of CPU cores. In contrast, GPU "clusters" are currently limited to about 8 GPUs connected to one machine. Of course, several of these machines can be combined together, but this adds additional complexity as the data must not only pass between computers but also between GPUs. Also, there isn't yet an MPI equivalent that lets processes transparently scale to multiple GPUs across multiple machines; it must be manually implemented (possibly in combination with MPI).

Aside from this problem of scale, the other major limitation of GPUs for parallel computing is the severe restriction on memory access patterns. Random memory access is possible, but carefully planned memory access will result in many-fold better performance.

Perhaps the most promising upcoming contender is Intel's Larrabee. It has considerably better access to the CPU, system memory, and, perhaps most importantly, caching. This should give it considerable advantages with many algorithms. If it can't match the massive memory bandwidth on current GPUs, though, it may be lag behind the competition for algorithms that optimally use this bandwidth.

The current generation of hardware and software requires a lot of developer effort to get optimal performance. This often includes restructuring algorithms to make efficient use of the GPU memory. It also often involves experimenting with different approaches to find the best one.

Note also that the effort required to get optimal performance is necessary to justify the use of GPU hardware. The difference between a naive implementation and an optimized implementation can be an order of magnitude or more. This means that an optimized CPU impelemntation will likely be as good or even better than a naive GPU implementation.

People are already working on .NET bindings for CUDA. See here. However, with the necessity of working at a low level, I don't think GPU computing is ready for the masses yet.

Eric
A: 

GPUs work well in problems where there is a high level of Data Level Parallelism, which essentially means there is a way to partition the data to be processed such that they can all be processed.

GPUs aren't inherently as fast at a clock speed level. In fact I'm relatively sure the clock speed on the shaders( or maybe they have a more GPGPU term for them these days?) is quite slow compared to the ALUs on a modern desktop processor. The thing is, a GPU has an absolutely enormous amount of these shaders. Turning the GPU into an a very largeSIMD processor. With the amount of shaders on a modern Geforce, for example, it's possible for a GPU to be working on several hundred(thousand?) floating point numbers at once.

So short, a GPU can be amazingly fast for problems where you can partition the data properly and process the partitions independently. It's not so powerful at [task(thread)-level parallelism][3]

[3]: http://en.wikipedia.org/wiki/Task_parallelism "TLP

Falaina
+7  A: 

Another technology that's coming for GPU-based processing is GPU versions of existing high-level computational libraries. Not very flashy, I know, but it has significant advantages for portable code and ease of programming.

For example, AMD's Stream 2.0 SDK includes a version of their BLAS (linear algebra) library with some of the computations implemented on the GPU. The API is exactly the same as their CPU-only version of the library that they've shipped for years and years; all that's needed is relinking the application, and it uses the GPU and runs faster.

Similarly, Dan Campbell at GTRI has been working on a CUDA implementation of the VSIPL standard for signal processing. (In particular, the sort of signal and image processing that's common in radar systems and related things like medical imaging.) Again, that's a standard interface, and applications that have been written for VSIPL implementations on other processors can simply be recompiled with this one and use the GPU's capability where appropriate.

In practice, these days already quite a lot of high-performance numerical programs do not do their own low-level programming, but rely on libraries. On Intel hardware, if you're doing number-crunching, it's generally hard to beat the Intel math libraries (MKL) for most things that it implements -- and using them means that you can get the advantages of all of the vector instructions and clever tricks in newer x86 processors, without having to specialize your code for them. With things like GPUs, I suspect this will become even more prevalent.

So I think a technology to watch is the development of general-purpose libraries that form core building blocks for applications in specific domains, in ways that capture parts of those algorithms that can be efficiently sent off to the GPU while minimizing the amount of nonportable GPU-specific cleverness required from the programmer.

(Bias disclaimer: My company has also been working on a CUDA port of our VSIPL++ library, so I'm inclined to think this is a good idea!)

Also, in an entirely different direction, you might want to check out some of the things that RapidMind is doing. Their platform was initially intended for multicore CPU-type systems, but they've been doing a good bit of work extending it to GPU computations as well.

Brooks Moses
+1  A: 

I'll repeat the answer I gave here.

Long-term I think that the GPU will cease to exist, as general purpose processors evolve to take over those functions. Intel's Larrabee is the first step. History has shown that betting against x86 is a bad idea.

Mark Ransom
I agree, but I dont think its neccesarily the advancement of general CPU's that will take over the GPU's responsibility, but instead multi-cored CPU's with specialized cores.
Neil N
Too bad Larrabee got cancelled
Michael Mullany
@Michael: yes, that outcome surprised me.
Mark Ransom
A: 

A big problem with the GPU technology is that while you do have a lot of compute capability in there, getting data into (and out of it) is terrible (performance-wise). And watch carefully for any comparison benchmarks... they often compare gcc (with minimal optimization, no vectorization) on a single processor system to the GPU.

Another big problem with the GPU's is that if you don't CAREFULLY think about how your data is organized, you will suffer a real performance hit internally (in the GPU). This often involves rewriting very simple code into a convoluted pile of rubbish.

xcramps
A: 

I'm very excited about this technology. However, I think that This will only exacerbate the real challenge of large parallel tasks, one of bandwidth. Adding more cores will only increase contention for memory. OpenCL and other GPGPU abstraction libraries don't offer any tools to improve that.

Any high performance computing hardware platform will usually be designed with the bandwidth issue carefully planned into the hardware, balancing throughput, latency, caching and cost. As long as commodity hardware, CPU's and GPU's, are designed in isolation of each other, with optimized bandwidth only to their local memory, it will be very difficult to improve this for the algorithms that need it.

TokenMacGuy
+1  A: 

GHC (Haskell) researchers (working for Microsoft Research) are adding support for Nested Data Parallelism directly to a general purpose programming language. The idea is to use multiple cores and/or GPUs on the back end yet expose data parallel arrays as a native type in the language, regardless of the runtime executing the code in parallel (or serial for the single-CPU fallback).

http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Depending on the success of this in the next few years, I would expect to see other languages (C# specifically) pick up on the idea, which could bring these sorts of capabilities to a more mainstream audience. Perhaps by that time the CPU-GPU bandwidth and driver issues will be resolved.

Jared Updike
+1  A: 

At AccelerEyes, we've been building Jacket, which offers GPU acceleration on MATLAB-based codes. We are currently using CUDA underneath the hood, for a variety of reasons, but we'll move to OpenCL when that technology matures at AMD and Intel. We posted a more in depth discussion of GPU languages here: AccelerEyes Blog

Other cool projects in the works include CULAtools (for LAPACK functions on the GPU), GPU VISPL, and libraries coming from Acceleware.

melonakos

related questions