Hi all,
How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?
Hi all,
How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?
If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.
Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.
In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x
. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.
If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.
If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF
Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax
, which is intuitively more efficient than your shl
because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea
operation (when possible). In the above example that would be lea eax, [eax*2]
. (Multiplication by 3 would be done through lea eax, [eax*2+eax]
)
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.
Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".
If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox