views:

285

answers:

6

Why is returning a std::pair or boost::tuple so much less efficient than returning by reference? In real codes that I've tested, setting data by non-const reference rather than by std::pair in an inner kernel can speed up the code by 20%.

As an experiment, I looked at three simplest-case scenarios involving adding two (predefined) integers to two integers:

  1. Use an inner, inlined function to modify the integers by reference

  2. Use two inner, inlined function to return ints by value

  3. Use an inner, inlined function to return a std::pair which are copied to the result.

Compiling with g++ -c $x -Wall -Wextra -O2 -S results in the same assembly code for passing by reference and returning ints by value:

__Z7getPairiRiS_:
LFB19:
    pushq %rbp
LCFI0:
    leal 1023(%rdi), %eax
    addl $31, %edi
    movl %eax, (%rsi)
    movq %rsp, %rbp
LCFI1:
    movl %edi, (%rdx)
    leave
    ret

(Pass by reference code:

#include <utility>

inline void myGetPair(const int inp, int& a, int& b) {
    a = 1023 + inp;
    b = 31 + inp;
}

void getPair(const int inp, int& a, int& b) {
    myGetPair(inp, a, b);
}

Using individual rvalues:

#include <utility>

inline int myGetPair1(int inp) {
    return 1023 + inp;
}

inline int myGetPair2(int inp) {
    return 31 + inp;
}

void getPair(const int inp, int& a, int& b) {
    a = myGetPair1(inp);
    b = myGetPair2(inp);
}

)

Using std::pair, however, adds five extra assembly statements:

__Z7getPairiRiS_:
LFB18:
    leal 31(%rdi), %eax
    addl $1023, %edi
    pushq %rbp
LCFI0:
    salq $32, %rax
    movq %rsp, %rbp
LCFI1:
    orq %rdi, %rax
    movq %rax, %rcx
    movl %eax, (%rsi)
    shrq $32, %rcx
    movl %ecx, (%rdx)
    leave
    ret

The code for that is nearly as simple as the previous examples:

#include <utility>

inline std::pair<int,int> myGetPair(int inp) {
    return std::make_pair(1023 + inp, 31 + inp);
}

void getPair(const int inp, int& a, int& b) {
    std::pair<int,int> result = myGetPair(inp);

    a = result.first;
    b = result.second;
}

Can anyone who knows the inner workings of compilers help with this question? The boost tuple page makes reference to a performance penalty for tuples vs. pass-by-reference, but none of the linked papers answer the question.

The reason I'd prefer std::pair to these pass-by-reference statements is that it makes the intent of the function much clearer in many circumstances, especially when other parameters are input as well as the ones that are to be modified.

+3  A: 

Returning by value forces the construction of unnamed temporaries on the stack and the assignment of their values to your local variables.

William Bell
Not true. The C++ Standard says that such temporaries may or may not be created.
anon
+3  A: 

Most C++ programmers value clarity of expression far above some dubious "efficiency" issues. But if you must know, returning a reference will almost always be implemented by the compiler by returning a pointer.

anon
My concern wasn't with returning references but with passing as a reference in place. And there is nothing dubious about a reproducible, nontrivial speed increase when returning a `std::pair` on my workhorse compiler and machine.
Seth Johnson
+5  A: 

Bad optimization. Let's hope that compilers will be better someday.

Alexey Malistov
+3  A: 

This is mostly guessing, but I think that one reason the compiler has a difficulty optimizing std::pair is that pair has a non-trivial constructor. This makes it non-POD, and the compiler can't make too aggressive assumptions when optimizing.

The POD-ness rules are changing in C++0x, though, so maybe that'll help things.

CAdaker
+3  A: 

What compiler are you using? On gcc 4.4.2 I get exactly the same result (opcode for opcode) for the pass-by-reference and the return-pairs-by-value. Namely:

mov 0x4(%esp),%eax
mov 0x8(%esp),%edx
lea 0x3ff(%eax),%ecx
add $0x1f,%eax
mov %ecx,(%edx)
mov 0xc(%esp),%edx
mov %eax,(%edx)
ret
lea 0x0(%esi),%esi

This is with -O2 -fomit-frame-pointer. It looks like the compiler understands the intent here and hasn't done anything redundant. Maybe you should upgrade yours :P

int3
+4  A: 

I tried that with VC++2008, using cl.exe /c /O2 /FAs foo.cpp (that's "compile only and do not link", "optimize for speed", and "dump assembly output with matching source code lines in comments"). Here's what getLine() ended up being.

"byref" version:

PUBLIC  ?getPair@@YAXHAAH0@Z                ; getPair
; Function compile flags: /Ogtpy
;   COMDAT ?getPair@@YAXHAAH0@Z
_TEXT   SEGMENT
_inp$ = 8                       ; size = 4
_a$ = 12                        ; size = 4
_b$ = 16                        ; size = 4
?getPair@@YAXHAAH0@Z PROC               ; getPair, COMDAT

; 9    :     myGetPair(inp, a, b);

    mov eax, DWORD PTR _inp$[esp-4]
    mov edx, DWORD PTR _a$[esp-4]
    lea ecx, DWORD PTR [eax+1023]
    mov DWORD PTR [edx], ecx
    mov ecx, DWORD PTR _b$[esp-4]
    add eax, 31                 ; 0000001fH
    mov DWORD PTR [ecx], eax

; 10   : }

    ret 0
?getPair@@YAXHAAH0@Z ENDP               ; getPair

"byval" std::pair-returning version:

PUBLIC  ?getPair@@YAXHAAH0@Z                ; getPair
; Function compile flags: /Ogtpy
;   COMDAT ?getPair@@YAXHAAH0@Z
_TEXT   SEGMENT
_inp$ = 8                       ; size = 4
_a$ = 12                        ; size = 4
_b$ = 16                        ; size = 4
?getPair@@YAXHAAH0@Z PROC               ; getPair, COMDAT

; 8    :     std::pair<int,int> result = myGetPair(inp);

    mov eax, DWORD PTR _inp$[esp-4]

; 9    : 
; 10   :     a = result.first;

    mov edx, DWORD PTR _a$[esp-4]
    lea ecx, DWORD PTR [eax+1023]
    mov DWORD PTR [edx], ecx

; 11   :     b = result.second;

    mov ecx, DWORD PTR _b$[esp-4]
    add eax, 31                 ; 0000001fH
    mov DWORD PTR [ecx], eax

; 12   : }

    ret 0
?getPair@@YAXHAAH0@Z ENDP               ; getPair

As you can see, the actual assembly is identical; the only difference is in mangled names and comments.

Pavel Minaev
Interesting; so it's therefore just an extra optimization that gcc doesn't do.
Seth Johnson