Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) { for(int n=0; n<num; n++) { vm_t data(n); for(int i=0; i<data.len(); i++) { data.insn(i).parse(); data.insn(i).eval(); } } }
So the execution trail might look like this:
data(1) insn(0) parse data(1) insn(0) eval data(1) insn(1) parse ... data(2) insn(1) eval data(2) insn(2) parse data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse data(2) insn(0) parse \ processor can do OOO as these two flow in data(1) insn(0) eval / data(2) insn(0) eval \ OOO opportunity here too data(1) insn(1) parse / data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?