views:

640

answers:

7

I'm running into an inconsistent optimization behavior with different compilers for the following code:

class tester
{
public:
    tester(int* arr_, int sz_)
        : arr(arr_), sz(sz_)
    {}

    int doadd()
    {
        sm = 0;
        for (int n = 0; n < 1000; ++n) 
        {
            for (int i = 0; i < sz; ++i)
            {
                sm += arr[i];
            }
        }
        return sm;
    }
protected:
    int* arr;
    int sz;
    int sm;
};

The doadd function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:

int arradd(int* arr, int sz)
{
    int sm = 0;
    for (int n = 0; n < 1000; ++n) 
    {
        for (int i = 0; i < sz; ++i)
        {
            sm += arr[i];
        }
    }
    return sm;
}

The doadd method runs about 1.5 times slower than the arradd function when compiled in Release mode with Visual C++ 2008. When I modify the doadd method to be as follows (aliasing all members with locals):

int doadd()
{
    int mysm = 0;
    int* myarr = arr;
    int mysz = sz;
    for (int n = 0; n < 1000; ++n) 
    {
        for (int i = 0; i < mysz; ++i)
        {
            mysm += myarr[i];
        }
    }
    sm = mysm;
    return sm;
}

Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++ seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2 or -O3.


The benchmarking is done by invoking the doadd member and arradd function on some sufficiently large array (a few millions of integers in size).


EDIT: Some fine-grained testing shows that the main culprit is the sm member. Replacing all others by local versions still makes the runtime long, but once I replace sm by mysm the runtime becomes equal to the function version.


alt text

Resolution

Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.

+5  A: 

It may be an aliasing issue - the compiler can not know that the instance variable sm will never be pointed at by arr, so it has to treat sm as if it were effectively volatile, and save it on every iteration. You could make sm a different type to test this hypothesis. Or just use a temporary local sum (which will get cached in a register) and assign it to sm at the end.

Paul R
I think the problem may be with access to `arr` also, not only `sm`. Besides, why does `g++` optimized it then? Does it take an unwarranted risk with such an optimization?
Eli Bendersky
@Eli: you may be right, it may just be MSVC being dumb - I just wanted to point out that there is a potential aliasing issue with the code. If you have the time and the inclination it might be worth testing each factor separately, i.e. just cache `sm`, just cache `arr`, just cache `sz`, then the various combinations. It would be instructive to know which of these make(s) a difference.
Paul R
@Paul: see my answer update. `sm` is indeed the main culprit. Still, I don't understand which compiler is wrong here - `g++` or `msvc`
Eli Bendersky
@Eli: it *may* depend on each compiler's default attitude towards aliasing - try `gcc -no-fstrict-aliasing` and see if that drags gcc performance down to that of MSVC. Or it may just be that gcc is smarter, of course. ;-)
Paul R
@Eli: You've not posed the assembly code, although it could help. I guess you've not even looked at it. "why does g++ optimized it" who said? Maybe g++ doesn't optimize both functions, thus the performance is the same.
ybungalobill
+1  A: 

As Paul wrote it is probably because sm member is really updated every time in the "real" memory , meanwhile local summary in the function can be accumulated in register variable (after compiler optimization).

Gregory
A: 

You can get similar issues when passing in pointer arguments. If you like getting your hands dirty, you may find the restrict keyword useful in future.

http://developers.sun.com/solaris/articles/cc_restrict.html

spraff
C++ does not have restrict.
DeadMG
Yes, it does: `__restrict__`
spraff
Or, for Microsoft-land: `__declspec(restrict)`
spraff
@spraff: C++ is what the standard defines, not your specific compiler extension. So DeadMG's comment is right.
ybungalobill
@spraff: and those were part of the C++ standard since when?
jalf
Just because something is a compiler extension doesn't mean you can't use it, and just because the original question is in C++ doesn't mean the answer can't be in C.
spraff
Relax, guys, this comment is only supposed to be tangential in the first place! The OP is not even talking about arguments passed in, I'm just trying to widen the answer a bit.
spraff
A: 
kriss
Your "other thread" remark sounds fishy, sorry. Wouldn't it mean all optimizations have to be off to account for un-thread-safe code accessing "in-progress" modified members from different threads?
Eli Bendersky
Your are probably right. From other threads points of view it's impossible to know if doadd is interrupted at beginning, in the middle or at the end. Then taking care of that when compiling doadd is certainly irrelevant. I shall remove the part about other threads in my answer. What we are left with is that there is no easy way to be sure that some class derived from test won't do `arr= doadd()`.
kriss
+3  A: 

MSVC is correct, in that it is the only one that, given the code we've seen, is guaranteed to work correctly. GCC employs optimizations that are probably safe in this specific instance, but that can only be verified by seeing more of the program.

Because sm is not a local variable, MSVC apparently assumes that it might alias arr. That's a fairly reasonable assumption: because arr is protected, a derived class might set it to point to sm, so arr could alias sm.

GCC sees that it doesn't actually alias arr, and so it doesn't write sm back to memory until after the loop, which is much faster.

It's certainly possible to instantiate the class so that arr points to sm, which MSVC would handle, but GCC wouldn't.

Assuming that sz > 1, GCCs optimization is permissible in general.

Because the function loops over arr, treating it as an array of sz elements, calling the function with sz > 1 would yield undefined behavior whether or not arr aliases sm, and so GCC could safely assume that they don't alias. But if sz == 1, or if the compiler can't be sure what sz's value might be, then it runs the risk that sz might be 1, and so arr and sm could alias perfectly legally, and GCC's code would break.

So most likely, GCC simply gets away with it by inlining the whole thing, and seeing that in this case, they don't alias.

jalf
"GCC's behavior is still allowed because arr is treated as an array of 1000 elements inside the loop" it's not. It's treated as an array of `mysz` elements.
ybungalobill
ah yeah, you're right. I guess I should've read the code more carefully. Will edit my post then :)
jalf
aaaand fixed. :)
jalf
Man, reading over my answer now, it's a mess. Guess that's what I get for so many extensive edits. Oh well... At least I'm pretty sure it's correct now. :)
jalf
why would "calling the function with `sz > 1` would yield undefined behavior" ?
Eli Bendersky
jalf
@jalf hmmm, I think you're wrong again. even for sz > 1 if you derive from tester and add additional fields it can be implementation-defined behavior. Not sure though.
ybungalobill
@ybungalobill: how do you mean? If `sz > 1`, then there are two cases to consider: either `arr` aliases `sm`, or it doesn't. If it does, then because `sz > 1`, it'll read past SM, and yield UB, and if they don't alias, then no matter what other changes are made to the class, it's safe to assume that they don't alias.
jalf
Although none of this is really important. It was relevant 4 edits ago, until you made your first comment. It's pretty clear that what really happens is just that GCC inlines everything, and sees that no aliasing occurs, so it's safe to keep `sm` in a register.
jalf
@jalf: reading past sm isn't necessarily undefined behavior. It's undefined if it's outside the most derived object: `struct A : tester { int arr[1000]; };` makes it implementation defined. Although I don't know the exact standard formulation for POD/nonPOD types.
ybungalobill
A: 

The key is probably that doadd is written like this if you make the member accesses explicit with this:

int doadd()
{
    this->sm = 0;

    for (int n = 0; n < 1000; ++n) 
    {
        for (int i = 0; i < this->sz; ++i)
        {
            this->sm += this->arr[i];
        }
    }

    return this->sm;
}

Therein lies the problem: all class members are accessed via the this pointer, whereas arradd has all variables on the stack. To speed it up, you have found that by moving all members on to the stack as local variables, the speed then matches arradd. So this indicates the this indirection is responsible for the performance loss.

Why might that be? As I understand it this is usually stored in a register so I don't think it's ultimately any slower than just accessing the stack (which is an offset in to the stack pointer as well). As other answers point out, it's probably the aliasing problem that generates less optimal code: the compiler can't tell if any of the memory addresses overlap. Updating sm could also in theory change the content of arr, so it decides to write out the value of sm to main memory every time rather than tracking it in a register. When variables are on the stack, the compiler can assume they're all at different memory addresses. The compiler doesn't see the program as clearly as you do: it can tell what's on the stack (because you declared it like that), but everything else is just arbitrary memory addresses that could be anything, anywhere, overlapping any other pointer.

I'm not surprised the optimisation in your question (using local variables) isn't made - not only would the compiler have to prove the memory of arr does not overlap anything pointed to by this, but also that not updating the member variables until the end of the function is equivalent to the un-optimised version updating throughout the function. That can be a lot trickier to determine than you imagine, especially if you take concurrency in to account.

AshleysBrain
A: 

I disassembled the code with MSVC to better understand what's going on. Turns out aliasing wasn't a problem at all, and neither was some kind of paranoid thread safety.

Here is the interesting part of the arradd function disassambled:

    for (int n = 0; n < 10; ++n)
    {
        for (int i = 0; i < sz; ++i)
013C101C  mov         ecx,ebp 
013C101E  mov         ebx,29B9270h 
        {
            sm += arr[i];
013C1023  add         eax,dword ptr [ecx-8] 
013C1026  add         edx,dword ptr [ecx-4] 
013C1029  add         esi,dword ptr [ecx] 
013C102B  add         edi,dword ptr [ecx+4] 
013C102E  add         ecx,10h 
013C1031  sub         ebx,1 
013C1034  jne         arradd+23h (13C1023h) 
013C1036  add         edi,esi 
013C1038  add         edi,edx 
013C103A  add         eax,edi 
013C103C  sub         dword ptr [esp+10h],1 
013C1041  jne         arradd+16h (13C1016h) 
013C1043  pop         edi  
013C1044  pop         esi  
013C1045  pop         ebp  
013C1046  pop         ebx  

ecx points to the array, and we can see that the internal loop is unrolled x4 here - note the four consecutive add instructions from following addresses, and ecx being advanced by 16 bytes (4 words) at a time inside the loop.

For the unoptimized version of the member function, doadd:

int tester::doadd()
{
    sm = 0;
    for (int n = 0; n < 10; ++n)
    {
        for (int i = 0; i < sz; ++i)
        {
            sm += arr[i];
        }
    }
    return sm;
}

The disassembly is (it's harder to find since the compiler inlined it into main):

    int tr_result = tr.doadd();
013C114A  xor         edi,edi 
013C114C  lea         ecx,[edi+0Ah] 
013C114F  nop              
013C1150  xor         eax,eax 
013C1152  add         edi,dword ptr [esi+eax*4] 
013C1155  inc         eax  
013C1156  cmp         eax,0A6E49C0h 
013C115B  jl          main+102h (13C1152h) 
013C115D  sub         ecx,1 
013C1160  jne         main+100h (13C1150h) 

Note 2 things:

  • The sum is stored in a register - edi. Hence, there's not aliasing "care" taken here. The value of sm isn't re-read all the time. edi isinitialized just once and then used as a temporary. You don't see its return since the compiler optimized it and used edi directly as the return value of the inlined code.
  • The loop is not unrolled. Why? No good reason.

Finally, here's an "optimized" version of the member function, with mysm keeping the sum local manually:

int tester::doadd_opt()
{
    sm = 0;
    int mysm = 0;
    for (int n = 0; n < 10; ++n)
    {
        for (int i = 0; i < sz; ++i)
        {
            mysm += arr[i];
        }
    }
    sm = mysm;
    return sm;
}

The (again, inlined) disassembly is:

    int tr_result_opt = tr_opt.doadd_opt();
013C11F6  xor         edi,edi 
013C11F8  lea         ebp,[edi+0Ah] 
013C11FB  jmp         main+1B0h (13C1200h) 
013C11FD  lea         ecx,[ecx] 
013C1200  xor         ecx,ecx 
013C1202  xor         edx,edx 
013C1204  xor         eax,eax 
013C1206  add         ecx,dword ptr [esi+eax*4] 
013C1209  add         edx,dword ptr [esi+eax*4+4] 
013C120D  add         eax,2 
013C1210  cmp         eax,0A6E49BFh 
013C1215  jl          main+1B6h (13C1206h) 
013C1217  cmp         eax,0A6E49C0h 
013C121C  jge         main+1D1h (13C1221h) 
013C121E  add         edi,dword ptr [esi+eax*4] 
013C1221  add         ecx,edx 
013C1223  add         edi,ecx 
013C1225  sub         ebp,1 
013C1228  jne         main+1B0h (13C1200h) 

The loop here is unrolled, but just x2.

This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).

What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc's disassembly and indeed: gcc doesn't unroll any of the versions.

Eli Bendersky