views:

3957

answers:

8
+9  Q: 

CUDA vs FPGA?

I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.

So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?

+13  A: 

I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:

  • FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
  • FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
  • FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.

If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.

Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.

Hope that helps.

biozinc
"CUDA will be certainly more flexible than a FPGA" is false. For CUDA, you have to twist and turn your algorithm in very specific ways to enjoy the speed-up. With FPGAs you can do whatever you want - i.e. implement specialized computation routines tailored just for your algorithm. Granted, this requires HDL programming knolwedge, so CUDA is indeed more accessible for software programmers.
Eli Bendersky
+2  A: 

CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.

At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.

Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.

ConcernedOfTunbridgeWells
+6  A: 

I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.

By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.

Martin Beckett
+11  A: 

We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.

Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).

Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.

(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.

flolo
Nice answer. While the other answers confirmed what we already researched, you provided some concrete examples when either or may be better. Thanks.
Fredriku73
There is something wrong with the 70GB/s value? The newest Fermi (2010) has 16x PCIe v2.0 lanes and this is 8GB/s. The on-card memory (GDDR5) can reach up to 54.4GB/s. This is fast but there is only few GB available.
name
+2  A: 

FPGA-based solution is likely to be way more expensive than CUDA.

OutputLogic
A: 

Obviously this is a complex question. The question might also include the cell processor. And there is probably not a single answer which is correct for other related questions.

In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.

Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.

The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.

Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.

+1  A: 

What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.

temp2290
+1  A: 

I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.

What I've concluded so far:

The GPU has by far higher ( accessible ) peak performance It has a more favorable FLOP/watt ratio. It is cheaper It is developing faster (quite soon you will literally have a "real" TFLOP available). It is easier to program ( read article on this not personal opinion)

Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.

BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.

my 2 cents

jim