Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.
Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. Example:
int y = data[i];
// do some stuff here..
call_function(y, ...);
The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.
// optimized version
call_function(data[i], ...); // not so optimized after all..
The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!
Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.
The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:
- memory is slow
- cache is fast
- try to use cached better
- how often you going to miss? do you have latency compensation strategy?
- you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
- application architecture is important..
- .. but it does't help when the problem isn't in the architecture
Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.
Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.
This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.
It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.