views:

615

answers:

11

In certain areas of development such as game development, real time systems, etc., it is important to have a fast and optimized program. On the other side, modern compilers do a lot of optimization already and optimizing in Assembly can be time consuming in a world where deadlines are a factor to take into consideration.

Questions:

  1. Is optimizing certain functions with Assembly in a C/C++ program really worth it?

  2. Is there really a sufficient gain in performance when optimizing a C/C++ program with Assembly with today's modern compilers?


What I understand with the answers posted, any gain that can be made is important in certain areas such as embedded systems, multimedia programming (graphics, sound, etc.). Also, one needs to be capable (or have someone capable) of doing a better job in Assembly than a modern compiler. Doing some really optimized C/C++ can take less time and can do a good enough job. One last thing, learning Assembly can help understand the inner mechanics of a program and make someone a better programmer in the end.

+9  A: 

The only possible answer to that is: yes, if there is a performance gain that is relevant and useful.

The question should I guess really be: Can you get a meaningful performance gain by using assembly language in a C/C++ program?

The answer is yes.

The cases where you get a meaningful increase in performance have probably diminished over the last 10-20 years as libraries and compilers have improved but for an architecture like x86 in particular hand optimization in certain applications (particularly graphics related) this can be done.

But like anything don't optimize until you need to.

I would argue that algorithm optimization and writing highly efficient C (in particular) will create far more of a performance gain for less time spent than rewriting in assembly language in the vast majority of cases.

cletus
x64 architecture being an extension to x86 architecture, are there similar gains?
Partial
Not so much, if the compiler is competent. x86 compilers are often limited to a generic subset. Hand-crafting assembly can be beneficial if the availability of things like SSE are known. x64 compilers have the advantage that SSE2 is guaranteed to be available, and can include that in the "generic" model, allowing more optimal assembly to be produced. There are limits, of course, but it's something you'd have to profile.
greyfade
The benefit of hand coding comes down to x86 having variable length instructions and, more importantly, instructions that take a variable number of cycles. In optimizing the programmer may know things that will determine the actual number of cycles that the compiler can't or won't figure out and, as such, make decisions unavailable to the compiler that can improve performance. It's fairly marginal most of the time however.
cletus
+1 for "only possible answer is yes". The question says is there a gain for "certain functions". Of course there exists at least one function which at least one compiler does badly on. End of answer, start of wider discussion...
Steve Jessop
+6  A: 

The difficulty is, can you do a better job of optimizing than the compiler can, given the architecture of modern cpus. If you are designing for a simple cpu (such as for embedded systems) then you may do reasonable optimizations, but, for a pipelined architecture the optimization is much harder as you need to understand how the pipelining works.

So, given that, if you can do this optimization, and you are working on something that the profiler tells you is too slow, and it is a part that should be as fast as possible, then yes optimizing makes sense.

James Black
You just need that Assembly expert ;)
Partial
Because they are so cheap. :) And bored silly sitting around waiting for work to come by.
James Black
+5  A: 

Maybe

It completely depends on the individual program

You need a profile, which you get with a profiling tool, before you know. Some programs spend all their time waiting for a database, or they just don't have concentrated runtime in a small area. Without that, assembly doesn't help much.

There is a rule of thumb that 90% of the runtime happens in 10% of the code. You really want one very intense bottleneck, and not every program has that.

Also, the machines are so fast now that some of the low-hanging fruit has been eaten, so to speak, by the compilers and CPU cores. For example, say you write way better code than the compiler and cut the instruction count in half. Even then if you end up doing the same number of memory references, and if they are the bottleneck, you may not win.

Of course, you could start preloading registers in previous loop iterations, but the compiler is likely to already be trying that.

Learning assembly is really more important as a way to comprehend what the machine really is, rather than as a way to beat the compiler. But give it a try!

DigitalRoss
+4  A: 

There is one area where assembly optimisation is still regularly performed - embedded software. These processors are usually not very powerful, and have many architectural quirks that may not be exploited by the compiler for optimisation. That said, it should still only be done for particularly tight areas of code and it has to be very well documented.

sybreon
+1  A: 

I would say that for most people and most applications, its not worth it. Compilers are very good at optimising precisely for the architecture they're being compiled for.

That's not to say that optimising in assembly isn't unwarranted. A lot of math and low-level intensive code is often optimised by using specific CPU instructions such as SSE* etc to overcome the compiler's generated instruction/register use. In the end, the human knows precisely the point of the program. The compiler can only assume so much.

I would say that if you're not at the level where you know your own assembly will be faster, then I would let the compiler do the hard work.

Nick Bedford
+3  A: 

I'll assume you've profiled your code, and you've found a small loop which is taking up most of the time.

First, try recompiling with more aggressive compiler optimizations, and then re-profile. If, you've running at will all compiler optimizations turned on, and you still need more performance, then I recommend looking at the generated assembly.

What I typically do after looking at the assembly code for the function, is see how I can change the C code to get the compiler to write better assembly. The advantage of doing it this way, is I then end up with code which is tuned to run with my compiler on my processor, but is portable to other environments.

brianegge
Is the generated Assembly the same as the compiled version?
Partial
Yes, the generated assembly is the same as the compiled object file, except that it is immensely easier to understand as you can arrange for the matching source lines to be interspersed as assembly comments. Getting a decent disassembler to do that starting from the object file is not easy.
RBerteig
Alternately, you can take the compiler-generated code and try to tweak it for greater performance, then leave it in assembly if you have to.
David Thornley
+4  A: 

For your typical small shop developer writing an App, the performance gain/effort trade-off almost never justifies writing assembly. Even in situations where assembly can double the speed of some bottleneck, the effort is often not justifiable. In a larger company, it might be justifiable if you're the "performance guy".

However, for a library writer, even small improvements for large effort are often justified, because it saves time for thousands of developers and users who use the library in the end. Even more so for compiler writers. If you can get a 10% efficiency win in a core system library function, that can literally save millennia (or more) of battery life spread across your user base.

Stephen Canon
+1: if you are that type of assembly programmer, and someone asks 'what's your carbon footprint', you can answer '-30,000 or so'.
soru
+20  A: 

I'd say it's not worth it. I work on software that does real-time 3D rendering (i.e., rendering without assistance from a GPU). I do make extensive use of SSE compiler intrinsics -- lots of ugly code filled with __mm_add_ps() and friends -- but I haven't needed to recode a function in assembly in a very long time.

My experience is that good modern optimizing compilers are pretty darn effective at intricate, micro-level optimizations. They'll do sophisticated loop transformations such as reordering, unrolling, pipelining, blocking, tiling, jamming, fission, and the like. They'll schedule instructions to keep the pipeline filled, vectorize simple loops, and deploy some interesting bit twiddling hacks. Modern compilers are incredibly fascinating beasts.

Can you beat them? Well, sure, given that they choose the optimizations to use by heuristics, they're bound to get it wrong sometimes. But I've found it's much better to optimize the code itself by looking at the bigger picture. Am I laying out my data structures in the most cache friendly way? Am I doing something unorthodox that misleads the compiler? Can I rewrite something a bit to give the compiler better hints? Am I better off recomputing something instead of storing it? Could inserting a prefetch help? Have I got false cache sharing somewhere? Are there small code optimization that the compiler thinks unsafe but is okay here (e.g., converting division to multiplication by the reciprocal)?

I like to work with the compiler instead of against it. Let it take care of the micro-level optimizations, so that you can focus on the mezzo-level optimizations. The important thing is to have a good idea how your compiler works so that you know where the boundaries between the two levels are.

Boojum
Very interesting point of view!
Partial
+1  A: 

Don't forget that by rewriting in assembly you lose portability. Today you don't care, but tomorrow your customers might want your software on another platform and them those assembly snippets will really hurt.

sharptooth
The smart way to rewrite in assembly is to leave multiple implementations, one of which is the plain C, which you continue to functionally test. Then when you come to port, you have something which works on the new platform, and which may or may not require platform-specific optimisation before it's fit to ship. If you're lucky, then the new platform is fast enough not to require it (either because you deliberately min-specced it that way, or through the passage of time, or because the only reason you had to write assembler in the first place was one dog-slow platform).
Steve Jessop
Will work, but you will have to maintain two versions of the same.
sharptooth
Of course. Possibly more, if you're written assembly for more than one CPU (variant). But since you can always fall back to the portable implementation, you make that decision on a per-platform basis. If the speedup on that platform is worth maintaining the assembly, then you maintain it. If it's not worth it, you don't. So regardless of whether it "really hurts", if it's worth it then obviously you do it anyway, minimising the hurt by keeping your exit open.
Steve Jessop
"Possibly more, if you're written assembly for more than one CPU (variant)". For instance, I worked on a project which for good reasons implemented memcpy. We had maybe 10-20 different assembly implementations, more than one of them for different ARM variants. Fortunately, you don't change the defined behaviour of memcpy very often. So the cost of maintenance was pretty low, and a new platform could just use the portable implementation to start with, and rewrite once it was up and running.
Steve Jessop
+1  A: 

definitely yes!

Here is demonstration of a CRC-32 calculation which I wrote in C++, then optimized in x86 assembler using Visual Studio.

InitCRC32Table() should be called at program start. CalcCRC32() will calculate the CRC for a given memory block. Both function are implemented both in assembler and C++.

On a typical pentium machine, you will notice that the assembler CalcCRC32() function is 50% faster then the C++ code.

The assembler implementation is not MMX or SSE, but simple x86 code. The compiler will never produce a code that is as efficient as a manually crafted assembler code.

    DWORD* panCRC32Table = NULL; // CRC-32 CCITT 0x04C11DB7

    void DoneCRCTables()
    {
        if (panCRC32Table )
        {
            delete[] panCRC32Table;
            panCRC32Table= NULL;
        }
    }

    void InitCRC32Table()
    {
        if (panCRC32Table) return;
        panCRC32Table= new DWORD[256];

        atexit(DoneCRCTables);

    /*
        for (int bx=0; bx<256; bx++)
        {
            DWORD eax= bx;
            for (int cx=8; cx>0; cx--)
                if (eax & 1)
                    eax= (eax>>1) ^ 0xEDB88320;
                else
                    eax= (eax>>1)             ;
            panCRC32Table[bx]= eax;
        }
    */
            _asm cld
            _asm mov    edi, panCRC32Table
            _asm xor    ebx, ebx
        p0: _asm mov    eax, ebx
            _asm mov    ecx, 8
        p1: _asm shr    eax, 1
            _asm jnc    p2
            _asm xor    eax, 0xEDB88320           // bit-swapped 0x04C11DB7
        p2: _asm loop   p1
            _asm stosd
            _asm inc    bl
            _asm jnz    p0
    }


/*
DWORD inline CalcCRC32(UINT nLen, const BYTE* cBuf, DWORD nInitVal= 0)
{
    DWORD crc= ~nInitVal;
    for (DWORD n=0; n<nLen; n++)
        crc= (crc>>8) ^ panCRC32Table[(crc & 0xFF) ^ cBuf[n]];
    return ~crc;
}
*/
DWORD inline __declspec (naked) __fastcall CalcCRC32(UINT        nLen       ,
                                                     const BYTE* cBuf       ,
                                                     DWORD       nInitVal= 0 ) // used to calc CRC of chained bufs
{
        _asm mov    eax, [esp+4]         // param3: nInitVal
        _asm jecxz  p2                   // __fastcall param1 ecx: nLen
        _asm not    eax
        _asm push   esi
        _asm push   ebp
        _asm mov    esi, edx             // __fastcall param2 edx: cBuf
        _asm xor    edx, edx
        _asm mov    ebp, panCRC32Table
        _asm cld

    p1: _asm mov    dl , al
        _asm shr    eax, 8
        _asm xor    dl , [esi]
        _asm xor    eax, [ebp+edx*4]
        _asm inc    esi
        _asm loop   p1

        _asm pop    ebp
        _asm pop    esi
        _asm not    eax
    p2: _asm ret    4                    // eax- returned value. 4 because there is 1 param in stack
}

// test code:

#include "mmSystem.h"                      // timeGetTime
#pragma comment(lib, "Winmm.lib" )

InitCRC32Table();

BYTE* x= new BYTE[1000000];
for (int i= 0; i<1000000; i++) x[i]= 0;

DWORD d1= ::timeGetTime();

for (i= 0; i<1000; i++)
    CalcCRC32(1000000, x, 0);

DWORD d2= ::timeGetTime();

TRACE("%d\n", d2-d1);
Lior Kogan
The compiler will often produce code more efficient than hand-crafted assembly code. It just won't happen all the time.
David Thornley
The whole idea of 'crafting' is to produce more efficient code than the compiler...
Lior Kogan
A truly skilled assembly programmer will *always* produce code at least as good as the compiler can. Usually, it will be better. I have never seen a non-trivial computational routine for which it was impossible to improve on the compiler's output. Sometimes the gains are small; sometimes they are huge. Of course, one should always file lots of bugs against the compiler to help make the compiler better as well, and to understand *why* the compiler doesn't use an optimization that is "obvious" to someone who knows every detail of the ISA.
Stephen Canon
+1  A: 

Good answers. I would say "Yes" IF you have already done performance tuning like this, and you are now in the position of

  1. KNOWING (not guessing) that some particular hot-spot is taking more than 30% of your time,

  2. seeing just what assembly language the compiler generated for it, after all attempts to make it generate optimal code,

  3. knowing how to improve on that assembler code.

  4. being willing to give up some portability.

Compilers do not know everything you know, so they are defensive and cannot take advantage of what you know.

As one example, they write subroutine entry and exit code in a general way that works no matter what the subroutine contains. You, on the other hand, may be able to hand-code little routines that dispense with frame pointers, saving registers, and stuff like that. You're risking bugs, but it is possible to beat the compiler.

Mike Dunlavey