ansaurus

Question

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

Answer 1

+2 A:

This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.

I'm short on details, sorry; this is just to give you an idea.

Carl Smotricz 2009-11-25 12:57:28

Actually, that point is long past. :)

Michael Foukarakis 2009-11-25 12:59:24

We know it, but we don't need it.

psihodelia 2009-11-25 13:12:57

Answer 2

+2 A:

Hi

I'm surprised you ask SO this question. Before you go much further, I'd suggest right now, you're going to have to start testing. Surely the porting project you refer to can afford a test bed ? Leaving that aside I think you're on the right track, the speed at which the CPU can get data from RAM is most likely going to be the limiting factor in your quest for speed.

Regards

Mark

High Performance Mark 2009-11-25 13:03:45

we cannot test it because it is completely different architecture

psihodelia 2009-11-25 13:11:48

Why don't you get a candidate machine and write some small benchmark programs for testing performance?

starblue 2009-11-25 14:37:29

I don't understand OP's assertion that the porting is untestable. I suspect that OP means that the current implementation is part-hardware and would require coding into C (or whatever) before becoming testable on a new platform. I stick with my original belief that no amount of arguing from first principles will reveal maximum speeds, OP will have to port and test.

High Performance Mark 2009-11-25 15:03:40

Answer 3

+3 A:

First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.

Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:

number of cores * cycles per second * flops per cycle * vector width

Which in your case sounds like:

4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops

If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).

edit: notes on DPPS:

DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.

More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.

In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.

Stephen Canon 2009-11-25 16:03:06

Can you please point me where can I find information about latency of different operations for Intel CPUs?

psihodelia 2009-11-27 09:55:41

Intel publishes the information in the "Intel Optimization Manual", which is a free download from http://www.intel.com/products/processor/manuals/

Stephen Canon 2009-11-27 15:47:34

Answer 4

A:

Some points you should consider:

1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.

2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).

3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.

4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).

5) Use Intel's highly efficient libraries if possible.

Whoever 2009-11-26 14:19:24

ansaurus

tags:

views:

answers:

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

related questions