views:

204

answers:

4

Hi!

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).

I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:

what tasks do FPGAs dominate CPUs on (in terms of pure speed) what tasks do FPGAs dominate CPUs on (in terms of work per jule)

Thanks!

Note: marked community wiki

+1  A: 

For pure speed: - Paralizable ones - DSP, e.g. video filters - Moving data, e.g. DMA

Brian Carlton
+9  A: 

[no links, just my musings]

FPGAs are essentially interpreters for hardware! The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.

So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:

  • Massive opportunities for fine-grained parallelism.
    (Doing 4 operations at once doesn't count; 128 does.)
  • Opportunity for deep pipelining.
    This is also a kind of parallelism, but it's hard to apply it to a single task, so it helps if you can get many separate tasks to work on in parallel.
  • (Mostly) Fixed data flow paths.
    Some muxes are OK, but massive random accesses are bad, cause you can't parallelize them. But see below about memories.
  • High total bandwidth to many small memories.
    FPGAs have hundreds of small (O(1KB)) internal memories (BlockRAMs in Xilinx parlance), so if you can partition you memory usage into many independent buffers, you can enjoy a data bandwidth that CPUs never dreamed of.
  • Small external bandwidth (compared to internal work). The ideal FPGA task has small inputs and outputs but requires a lot of internal work. This way your FPGA won't starve waiting for I/O. (CPUs already suffer from starving, and they alleviate it with very sophisticated (and big) caches, unmatchable in FPGAs.) It's perfectly possible to connect a huge I/O bandwidth to an FPGA (~1000 pins nowdays, some with high-rate SERDESes) - but doing that requires a custom board architected for such bandwidth; in most scenarios, your external I/O will be a bottleneck.
  • Simple enough for HW (aka good SW/HW partitioning).
    Many tasks consist of 90% irregular glue logic and only 10% hard work ("kernel" in the DSP sense). If you put all that onto an FPGA, you'll waste precious area on logic that does no work most of the time. Ideally, you want all the muck to be handled in SW and fully utilize the HW for the kernel. ("Soft-core" CPUs inside FPGAs are a popular way to pack lots of slow irregular logic onto medium area, if you can't offload it to a real CPU.)
  • Weird bit manipulations are a plus.
    Things that don't map well onto traditional CPU instruction sets, such as unaligned access to packed bits, hash functions, coding & compression... However, don't overestimate the factor this gives you - most data formats and algorithms you'll meet have already been designed to go easy on CPU instruction sets, and CPUs keep adding specialized instructions for multimedia.
    Lots of Floating point specifically is a minus because both CPUs and GPUs crunch them on extremely optimized dedicated silicon. (So-called "DSP" FPGAs also have lots of dedicated mul/add units, but AFAIK these only do integers?)
  • Low latency / real-time requirements are a plus.
    Hardware can really shine under such demands.
Beni Cherniavsky-Paskin
I like. Upvoted.
anon
+3  A: 

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)

On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.

You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.

If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.

IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.

If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!

Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.

ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.

You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!

The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

NoMoreZealots
A: 

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.

We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.

We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.

We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.

Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.

Try doing that in software!! :-)

SDGator