views:

951

answers:

7

I'm having a hard time beating my compiler using inline assembly.

What's a good, non-contrived examples of a function which the compiler has a hard time making really, really fast and simple? But that's relatively simple to make with inline assembly.

A: 

My best win out over a compiler was on a simple memcpy routine... I skipped a lot of the basic setup stuff ( e.g., I didn't need much of a stack frame, so I save a few cycles there ), and did a few pretty hairy things.

That was about 6 years ago, with some proprietary compiler of unknown quality. I'll have to dig up the code I had and try it against GCC now; I don't know that it could get any faster, but I wouldn't rule it out.

In the end, even though my memcpy was on average about 15x faster than the one in our C library, I just kept it in my back pocket in case I needed it. It was a toy for me to play with PPC assembly, and the speed boost wasn't necessary in our application.

Chris Arguin
+2  A: 

If you want to do stuff like SIMD operations, you might be able to beat a compiler. This will require good knowledge of the architecture and the instruction set though.

samoz
You really can't understate the importance of understanding the architecture and instruction set when dealing with assembly. I typically avoid asm, but I still make it point to learn the capabilies of the architecture so I can have some idea of the theoretical performance available.
NoMoreZealots
+7  A: 

If you don't consider SIMD operations cheating, you can usually write SIMD assembly that performs much better than your compilers autovectorization abilities (If it even has autovectorization!)

Here's a very basic SSE(One of x86's SIMD instruction sets) tutorial. It's for Visual C++ in-line assembly.

Edit: Here's a small pair of functions if you want to try for yourself. It's the calculation of an n length dot product. One is using SSE 2 instructions in-line (GCC in-line syntax) the other is very basic C.

It's very very simple and I'd be very surprised if a good compiler couldn't vectorize the simple C loop, but if it doesn't you should see a speed up in the SSE2. The SSE 2 version could probably be faster if I used more registers but I don't want to stretch my very weak SSE skills :).

 float dot_asm(float *a, float*b, int n)
{
  float ans = 0;
  int i; 
  // I'm not doing checking for size % 8 != 0 arrays.
  while( n > 0) {
    float tmp[4] __attribute__ ((aligned(16)));

     __asm__ __volatile__(
            "xorps      %%xmm0, %%xmm0\n\t"
            "movups     (%0), %%xmm1\n\t"
            "movups     16(%0), %%xmm2\n\t"
            "movups     (%1), %%xmm3\n\t"
            "movups     16(%1), %%xmm4\n\t"
            "add        $32,%0\n\t"
            "add        $32,%1\n\t"
            "mulps      %%xmm3, %%xmm1\n\t"
            "mulps      %%xmm4, %%xmm2\n\t"
            "addps      %%xmm2, %%xmm1\n\t"
            "addps      %%xmm1, %%xmm0"
            :"+r" (a), "+r" (b)
            :
            :"xmm0", "xmm1", "xmm2", "xmm3", "xmm4");

    __asm__ __volatile__(
        "movaps     %%xmm0, %0"
        : "=m" (tmp)
        : 
        :"xmm0", "memory" );          

   for(i = 0; i < 4; i++) {
      ans += tmp[i];
   }
   n -= 8;
  }
  return ans;
}

float dot_c(float *a, float *b, int n) {

  float ans = 0;
  int i;
  for(i = 0;i < n; i++) {
    ans += a[i]*b[i];
  }
  return ans;
}
Falaina
SIMD is definitely not cheating. It provides a clear case of where compilers haven't kept up with hardware. C doesn't handle instruction level parallelizism well. Maybe it can unrolling loops here and there, but more advance routines need serious tweaking.
NoMoreZealots
There are plenty of compilers that will output SIMD instructions.
jrockway
They will, for limited cases. Basically as long as your code is written with a common technique or algorithm. Once the instruction set grows too big, optimal use of many instructions start to get lost in wash when writting a compiler or optimizer simply due to complexity. This was large part of the basis for the "RISC" processor concept. Optimization is simalar to chess, a computer can beat the majority people, but it takes a much more than a desktop to beat a grand master.
NoMoreZealots
+5  A: 

Unless you are an assembly guru the odds of beating the compiler are very low.

A fragment from the above link,

For example, the bit-oriented "XOR %EAX, %EAX" instruction was the fastest way to set a register to zero in the early generations of the x86, but most code is generated by compilers and compilers rarely generated XOR instruction. So the IA designers, decided to move the frequently occurring compiler generated instructions up to the front of the combinational decode logic making the literal "MOVL $0, %EAX" instruction execute faster than the XOR instruction.

Nick D
I'm not a an assembly guru, and I've beat the compiler. I very rarely resort to assembly. It was a last resort when I had to. This just seems like nay-saying. And it ignores his question. He admits it isn't easy in the question.
NoMoreZealots
I didn't say it's impossible. If you grok the instruction set you can try to write faster code or to squeeze the routine to fewer instructions. If you have a not very sophisticated compiler or the compiler doesn't handle the sse, 3dnow sets, writing assembly might be the *proper* way to implement some routines.
Nick D
You are right, understanding the instruction set is a absolute necessity if you want to have any hope of beating a complier. But even with a good compiler, you can find instructions that don't have C constructs that map well to them on modern architectures. There are still "gaps" in the abstractions that are just growing larger as the multicore paradigm becomes the norm. And in today's power conscious and mobile driven market, we can't assume a faster cpu core speeds in our applications. CPUs hit 1GHz in 1999, and new apps are running on the "hottest" hard are clocking at 400Mhz today.
NoMoreZealots
By "Hottest" hardware, I mean things like the IPhone, and what not. Battery life makes the tradeoffs between efficiency and development time tilt in a completely new direction.
NoMoreZealots
Pete, I don't argue. And these *gaps* are another example of the *leaky abstraction* notion :), http://en.wikipedia.org/wiki/Leaky_abstraction
Nick D
When it comes to how languages map to hardware the issue is more fundomental than general software development. As much as I like C and C++, very few languages capture the notion parallelizism well. I would really like to see more Computer Science influence and cross-over in the hardware description languages such as VHDL. They do parallelizism well, but suck at abstractions. Cross pollenization would help both sides of the fence.
NoMoreZealots
+4  A: 

I implemented a simple cross correlation using a generic "strait C" implementation. And THEN when it took longer than the timeslice I had available, I resorted to explicit parallelization of the algorithm and using processor intrinsic to force the specific instructions to be used in the calculations. For this particular case, the computation time was reduce from >30ms to just over 4ms. I had a 15ms window to complete processing before the next data acquisition occurred.

This was a SIMD type optimization on a VLWI processor. This only require 4 or so of the processor intrinsics, which are basically assembly language instructions that give the appearance of a function call in the source code. You could do the same with inline assembly but the syntax and register management is a little nicer with processor intrinsics.

Other than that if size matters, assembler is king. I went to school with a guy who wrote a full screen text editor in less than 512 bytes.

NoMoreZealots
This is a classic case where assembler is sensible. The code was written in C; worked, but not fast enough. Recoding in assembler made it work fast enough - that was a good reason to drop into assembler.
Jonathan Leffler
I was dissappointed at the performance I got out of the strait C version, the chip vender's propaganda bragged about how good their C compiler was. And they're most recent toolchain doesn't do any better job optimizing it either. Unfortunately DSPs with VLWI aren't easy to write an optimizer for.
NoMoreZealots
+3  A: 

I have an checksum algorithm which requires words to be rotated by a certain number of bits. To implement it, I've got this macro:

//rotate word n right by b bits
#define ROR16(n,b) (((n)>>(b))|(((n)<<(16-(b)))&0xFFFF))

//... and inside the inner loop: 
sum ^= ROR16(val, pos);

VisualStudio release build expands to this: (val is in ax, pos is in dx, sum is in bx)

mov         ecx,10h 
sub         ecx,edx 
mov         ebp,eax 
shl         ebp,cl 
mov         cx,dx 
sar         ax,cl 
add         esi,2 
or          bp,ax 
xor         bx,bp

The more efficient equivalent hand-generated assembly would be:

 mov       cl,dx
 ror       ax,cl
 xor       bx,ax

I haven't figured out how to emit the ror instruction from pure 'c' code. However...
While writing this up, I remembered compiler intrinsics. I can generate the second set of instructions with:

sum ^= _rotr16(val,pos);

So my answer is: Even if you think you can beat the pure c compiler, check the intrinsics before resorting to inline assembly.

AShelly
Nice concrete example.
NoMoreZealots
I tried this in gcc (4.0.1) with -O4. It does output a ROR instruction for a 32-bit rotate, but not for 16 bits.
finnw
+3  A: 
pps