tags:

views:

1251

answers:

5

Newer ARM processors include the PLD and PLI instructions.

I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!

I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.

How can I include explicit prefetch instructions in my C++ code?

+3  A: 

There should be some Compiler-specific Features. There is no standard way to do it for C/C++. Check out you compiler Compiler Reference Guide. For RealView Compiler see this or this.

+1  A: 

If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. You should be able to use inline assembly depending on the data structures involved in your loop. Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential).

Loren Charnley
A: 

At the risk of asking the obvious: have you verified the compiler's target architecture? For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction.

Dan
A: 

C++ is pretty inefficient if you want performance you shouldn't be using C++ in the first place. Switching languages and fiddling with compiler optimizations do wonders for performance (you probably no this but in general developers are surprised to find the same source code can run several times faster by just knowing how to use the compiler). If you are using ARM's tools then you are on the right track, they at least used to be superior to other alternatives at any price.

Fiddling any high(er than assembler) level language to make the assembly what you want will eventually fail, compile once and disassemble. Use that code as a baseline or write your own from scratch and leave that routine in assembler for the duration. Having a few compilers at your disposal makes this even better as each one is likely different with the occasional differently skilled guru you sometimes find assembler gems when comparing the compiler output for the same code among competitors.

If the instruction is missing or not supported by the assembler then just add it yourself:

  mov r0,blah
  add r1,blah
  .word 0xopcode
  mov r2,foo
  mvn r3,bar

Profiling or timing your code will tell you what is faster and what isnt. It is difficult at best to hand tune without timing your code, there is always a gotcha that you didnt add into your estimates.

dwelch
A: 

It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. Is your compiler doing these types of traditional optimizations instead. If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help.

Matt J