views:

563

answers:

8

Hi all,

How much faster is the following assembler code:

shl ax, 1

Versus the following C code:

num = num * 2;

How can I even find out?

+3  A: 

If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.

steve
Agreed. If the code is significantly different, *then* go through the hassle of benchmarking and detecting statistical significance.
Paul Nathan
+1  A: 

Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.

The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.

By the way, I think you should add a third test:

num <<= 1;

The question then is whether that does the same thing as the assembly version.

Warren Young
+4  A: 

In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.

You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).

Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).

Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.

Stephen Canon
That's interesting. Do you remember the processor on which x+x is faster than x<<1, and if that's an ia-32 / amd64, whether it was compiled as `add` or `lea`?
Pascal Cuoq
Not an x86 processor. An embedded processor where the shifter is running one cycle behind the adder, and there was a one-cycle stall when the result of an add is consumed by the shifter.
Stephen Canon
Lots of processors have more function units that can do an add than function units that can do a shift, so even if they are the same latency it helps to use the add instead (possibly better ILP). That said, you'll be super lucky to notice any difference.
Keith Randall
Actually, shifts were sometimes more costly than add on X86. Back in the day of the pentium, shifts and rotates could only run in one of the two pipelines so if you tried to do schedule two independent shifts in the same clock cycle one of them would stall. Addition did not have this limitation. See Michael Abrash's Black Book or Zen of Code Optimization for futher details.
Adisak
+1  A: 

If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check

So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.

James Sutherland
+4  A: 

If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.

To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF

Pascal Cuoq
+20  A: 

Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?

On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).

Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])

The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.

AndreyT
Nice answer, thanks!
Kyle Rozendo
And just to add to the mix, if num is stored in the ax register, the shift might be what the compiler generates; if num is not stored in the ax register, then the assembler doesn't do the same job as the C; you have to get the value into the right register, do the shift, and store the result again.
Jonathan Leffler
Depends not only on how the compiler implements it but if the variable is global or local and how it's used. For example, if you are using the value only temporarily as an array index, the C compiler may use the addressing mode to do the computation and avoid even generating an instruction at all.
Adisak
+4  A: 

Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".

Mike Dunlavey
Only saw this now, hehe. It was a pure academic "interest" question, absolutely no real life relevance.
Kyle Rozendo
A: 

If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox

Arabcoder