views:

269

answers:

4

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.

I have selected a typical representative of the modern desktop CPUs -
Quad Core, about 10Mb cache, 3GHz, 45nm.
Can you please help me to find out its limits:

1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?

2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).

UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.

+2  A: 

This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.

I'm short on details, sorry; this is just to give you an idea.

Carl Smotricz
Actually, that point is long past. :)
Michael Foukarakis
We know it, but we don't need it.
psihodelia
+2  A: 

Hi

I'm surprised you ask SO this question. Before you go much further, I'd suggest right now, you're going to have to start testing. Surely the porting project you refer to can afford a test bed ? Leaving that aside I think you're on the right track, the speed at which the CPU can get data from RAM is most likely going to be the limiting factor in your quest for speed.

Regards

Mark

High Performance Mark
we cannot test it because it is completely different architecture
psihodelia
Why don't you get a candidate machine and write some small benchmark programs for testing performance?
starblue
I don't understand OP's assertion that the porting is untestable. I suspect that OP means that the current implementation is part-hardware and would require coding into C (or whatever) before becoming testable on a new platform. I stick with my original belief that no amount of arguing from first principles will reveal maximum speeds, OP will have to port and test.
High Performance Mark
+3  A: 

First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.

Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:

number of cores * cycles per second * flops per cycle * vector width

Which in your case sounds like:

4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops

If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).

edit: notes on DPPS:

DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.

More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.

In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.

Stephen Canon
Can you please point me where can I find information about latency of different operations for Intel CPUs?
psihodelia
Intel publishes the information in the "Intel Optimization Manual", which is a free download from http://www.intel.com/products/processor/manuals/
Stephen Canon
A: 

Some points you should consider:

1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.

2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).

3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.

4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).

5) Use Intel's highly efficient libraries if possible.

Whoever