When the function is inline typically no copying of variables is directly involved with the call. Variables will still be moved around and put on the stack sometimes as a normal part of execution but not as a direct result of the function call. (When you run out of registers, some values may get put on the stack, etc... but only if needed.) So the overhead of the "call" basically disappears when a function is inlined (No more setting up/tearing down the stack frame, no more unconditional jump, no more pushing/poping parameters.)
If you can rely on your always_inline
attribute to always inline the function, then you should also not pass the Vector by pointer (if it isn't modified). The reason for this is that passing it by pointer requires the vector's address be taken, which means that the compiler must ensure that it has an address and thus it cannot exist only in CPU registers. This can slow things down if it isn't needed, and when you take the address of something the compiler will always ensure it has an address because the compiler can't be sure the address isn't needed.
Because of the pass-by-pointer, this code will ALWAYS have an instruction to get the object's address, and at least one dereference to get at a member's value. If you pass-by-value then this MAY still happen, but the compiler MAY be able to optimize all of that away.
Don't forget that overuse of inlining can significantly increase the size of the compiler binary code. In certain cases having large code segments (as a result of inline functions) can cause more instruction cache misses with will result in slower performance because the CPU is constantly having to go out to main memory to fetch parts of your program because some of it is too big to fit in the small L1 cache. This may be especially important in embedded processors (like the iPhone) because these processors typically have small caches.