views:

191

answers:

5

Hi all,

I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of algorithms from the multimedia domain and I wonder what is the best way to exploit ILP in this algorithms. All this algorithms are implemented in C, so ideally I give these algorithms as input to some tool and it tells me which instructions could be executed in parallel.

Many thanks for any points!

Robert

+8  A: 

The problem is that deciding whether an instruction will be executed in parallel is quite difficult considering how many different processor types there are. A good understanding of the CPU architecture you are targeting will give you a good starting point for doing this sort of work. No software will beat a human mind with the right knowledge.

In general though so much work is done by the compiler and things like Out-of-order execution engines that this tries to get abstracted as much away from you as possible. You will find even by understanding this fully its unlikely you'll get more than a few percent speed improvement.

If you want to see serious speed improvements you are far better off re-writing the algorithm to take advantage of multiple processors and available SIMD operations. You can see serious speed improvements using SIMD alone and this is especially so for a lot of "multimedia algorithms" that can process multiple elements of the data simultaneously.

Goz
Voted you down as your comment about software being unable to beat a human mind is completely wrong: you are confusing the issue with the ability to optimise code. In this case the hardware is the final judge of what can be executed in parallel and this information is encoded exactly into an algorithm that decides ILP.
Amoss
@Amoss: Even then I have to disagree with you. The hardware can reschedule instructions but it can't perform macro optimisations. Therein the human brain can win. Micro optimisations are helpful but don't give as much win as macro optimisations. ILP is irrelevant on this point. So the hardware can reorganise a few instructions? Big deal. A human that knows how those instructions can be reorganised can write faster code. As such I maintain that you are wrong in your assertion.
Goz
How on earth can ILP be irrelevant on this point? The question is how to decide what ILP is available.
Amoss
@Amoss: ok ill rephrase ... the hardware algorithm behind ILP is irrelevant on the point of whether software or the human mind is a better optimiser. The human mind is ABSOLOUTELY the better optimiser. The hardware is rubbish and only works well in VERY specific circumstances. You downvoted me for saying that and I will continue to maintain you are talking utter crap. I'll re-iterate: "No software will beat a human mind with the right knowledge."
Goz
I have no issue with you claiming that a human can optimise code better than a compiler; although that question is still open I'll accept your claim. My point was quite simple: the question of whether or not instructions will be executed in parallel is decidable exactly by a machine, and has to be decided by the processor doing the execution. It is clearly different from the question of whether instructions have the potential to be executed in parallel. You know it would be quite easy to edit your wording to clear the point up and then I'll vote you back up. Much easier than writing insults.
Amoss
+1  A: 

Do you have reason to believe that the compiler is doing a poor job of uncovering ILP? If you work on the algorithm level normally the focus should be on data parallellism and higher-order optimizations. Optimizing for ILP would be the absolutely last step and is totally tied to how the compiler works. In general, if you can eliminate false data dependencies a decent compiler should do the rest for you.

Something like Acumems SlowSpotter may be a help (unless you really need to hand-optimize for ILP in which case I don't know of a good tool unless the compiler can spit out a good optimization report for you, IIRC the Cray and SGI MIPS compilers could produce reports like that.).

Per Ekman
+3  A: 

If I read you correctly, you are not interested in SIMD or threads, just getting optimal ordering of normal CPU instructions.

The first thing to check is if your compiler is targeting the correct CPU subtype. The compiler will usually reorder instructions to reduce dependencies from one instruction to another, but it is vital for the compiler to know specifically which version of the CPU you're targeting. (specifically older GCC sometimes fails to detect recent CPUs and then optimizes for i386).

Second thing you can do is checking your compiler inlining decisions (by looking at the assembler). Inlining small functions in algorithms can increase the code size but will improve the amount of opportunity for the compiler to optimize, as multiple calculations can be done in paralell. I often resort to forced inlining.

Lastly, for intel cpu's, Intel's own C++ compiler claims to be best at this. They also have the vTune profiler that specifically can report efficient use of the ALU's in your program's hotspots.

jdv
+3  A: 

First, both the compiler and the CPU itself already aggressively reorder instructions to exploit ILP as well as possible. Most likely, they're doing a better job of it than you'd ever be able to.

However, there are a few areas where a human can aid the process.

The compiler is typically very conservative about reordering floating-point computations, because it might slightly change the result. So for example assuming this code:

float f, g, h, i;
float j = f + g + h + i;

you'll likely get zero ILP because the code you've written is evaluated as ((f + g) + h) + i: the result of the first addition is used as an operand for the next, the result of which is used as an operand in the final addition. No two additions can execute in parallel.

If you instead write it as float j = (f + g) + (h + i), the CPU is able to execute f+g and h+i in parallel. They don't depend on each others.

In general, the thing preventing ILP is dependencies. Sometimes they're direct dependencies between arithmetic instructions as above, and sometimes they're store/load dependencies.

Loads and stores take a long time to execute compared to in-register operations, and operations that depend on these will have to wait until the load/store operation finished.

So storing data in temporaries which the compiler can cache in registers can sometimes be used to avoid memory accesses. Likewise, starting loads as soon as possible helps too, to avoid their latency from blocking the following operations.

The best technique is really to look closely at your code, and work out the dependency chains. Each sequence of operations where each one depends on the result of the previous is a chain of dependencies that can never be executed in parallel. Can this chain be broken up in some way? Perhaps by storing a value in a temporary, or perhaps by recomputing a value instead of waiting for the cached version to be loaded from memory. Perhaps just by placing a few parentheses as in the original floating-point example.

When there are no dependencies, the CPU will schedule operations to execute in parallel. So all you need to do to exploit ILP is to break up long dependency chains.

Of course, that's easier said than done... :)

But if you spend some time with a profiler, and study the assembly output from the compiler, you can sometimes get an impressive speedup from manually optimizing your code to better exploit ILP.

jalf
A: 

The previous answers are good. In addition, there is much to learn at Intel's site, and if you have a budget then Intel's tools would be worth looking at.
Intel's articles on Optimization

rleir