C++ is pretty inefficient if you want performance you shouldn't be using C++ in the first place. Switching languages and fiddling with compiler optimizations do wonders for performance (you probably no this but in general developers are surprised to find the same source code can run several times faster by just knowing how to use the compiler). If you are using ARM's tools then you are on the right track, they at least used to be superior to other alternatives at any price.
Fiddling any high(er than assembler) level language to make the assembly what you want will eventually fail, compile once and disassemble. Use that code as a baseline or write your own from scratch and leave that routine in assembler for the duration. Having a few compilers at your disposal makes this even better as each one is likely different with the occasional differently skilled guru you sometimes find assembler gems when comparing the compiler output for the same code among competitors.
If the instruction is missing or not supported by the assembler then just add it yourself:
mov r0,blah
add r1,blah
.word 0xopcode
mov r2,foo
mvn r3,bar
Profiling or timing your code will tell you what is faster and what isnt. It is difficult at best to hand tune without timing your code, there is always a gotcha that you didnt add into your estimates.