The code speed is mostly influenced by low level optimizations of the computer architecture, both in terms of CPU as well as other optimizations.
There are a lot of factors in code speed and they're usually low level questions which are automatically handled by the compiler, but that can make your code faster if you know what you're doing.
First of all, obviously Word Size. 64 bit machines have a bigger word size (an yeah, bigger usually means better here) so that most operations can be carried out faster, as for example double precision operations (where double usually means 2 * 32 bits). A 64 bits architecture also benefits from a bigger data bus which provide faster data transfer rates.
Second, pipeline is also important. Basic instructions can be classified in different states, or phases so that, for example, instructions are usually divided in:
- Fetch: The instruction is read from the instruction cache
- Decode: The instruction is decoded an interpreted to see what we have to do.
- Execute: The instruction is executed (usually that means carrying operations in the ALU)
- Memory Access: If the instruction has to access memory (for example load a registry value from the data cache) its performed here.
- Writeback: The vaues are written back to the destination register.
Now, the pipeline allows the processor to divide instructions on those phases and perform them simultaneously, so that while it's executing one instruction, it is also decoding the next one an fetching the one after that.
Some instructions have dependencies. If I'm adding to registers together, executing phase of the add instruction will need the values before they're actually recovered from memory. By knowing the pipeline structure, the compiler can reorder the assembly instructions in order to provide enough "distance" between the loads and the add so that the CPU need not to wait.
Another CPU optimization will be superscalar, which makes use of redundant ALUs (for example) so that two add instructions can be performed simultaneously. Again, by knowing exactly the architecture you can optimize ordering of instructions to take advantage. For example, if the compiler detects no dependencies exists in the code, it can rearrange loads and arithmetic so that the arithmetic is delayed to a later place where all data is available and then perform 4 operations at the same time.
This is mostly used by compilers though.
What can be of use when designing your application and which can really improve code speed is knowing the cache policies and organization. The most typical example is put for an incorrectly ordered access to an double array in a loop:
// Make an array, in memory this is represented as a 1.000.000 contiguous bytes
byte[][] array1 = new byte[1000, 1000];
byte[][] array2 = new byte[1000, 1000;
// Add the array items
for (int j = 0; j < 1000; i++)
for (int i = 0; i < 1000; j++)
array1[i,j] = array1[i,j] + array2[i,j]
Let's see what's happening here.
array1[0,0] is brought to cache. Since cache works in blocks you get the first 1000 bytes into cache, so that cache holds array1[0,0] to array1[0,999].
array2[0,0] is borught to cache. Again blocks so that you have array2[0,0] to array2[0,999].
On the next step we access array1[1,0] which is not in the cache, and neither is array2[1,0] so we bring them from memory to cache. Now, if we suppose we have a very little cache size, this will make array2[0...999] to be taken out of the cache... and so on. So when we access array2[0,1] it will no longer be in the cache. Cache won't be useful for array2 or array1.
If we reorder the memory accesses:
for (int i = 0; i < 1000; i++)
for (int j = 0; j < 1000; j++)
array1[i,j] = array1[i,j] + array2[j,i]
The no memory has to be brought out of the cache and the program will run considerably faster.
This are all naive, academic examples, if you really want or need to learn computer architecture you need a very deep knowledge of the specifics of the architecture, but again that will only be useful when programming compilers. Nonetheless, a basic knowledge of cache and basic low level CPU can help you improve your speed.
For example, such knowledge can be of extreme value in cryptographic programming where you have to handle very big numbers (as in 1024 bits) so that the correct representation can improve the underneath math that needs to be carried out...