views:

499

answers:

7

I've recently been writing some code for a research project that I'm working on, where efficiency is very important. I've been considering scraping some of the regular methods I do things in and using bitwise XORs instead. What I'm wondering is if this will make if a difference (if I'm performing this operation say several million times) or if it's the same after I use 03 in g++.

The two examples that come to mind:

I had an instance where (I'm working with purely positive ints) I needed to change n to n-1 if n was odd or n to (n+1) if n was even. I figured I had a few options:

if(n%2) // or (n%2==0) and flip the order
    n=n-1
else
    n=n+1

or

n=n+2*n%2-1; //This of course was silly, but was the only non-bitwise 1 line I could come up with

Finally:

n=n^1;

All of the methods clearly do the same thing, but my feeling was that the third one would be the most efficient.

The next example is on a more general note. Say I'm comparing two positive integers, will one of these perform better than the others. Or will the difference really not be noticeable, even if I perform this operation several million times:

if(n_1==n_2)
if(! (n_1 ^ n_2) )
if( n_1 ^ n_2) else \do work here

Will the compiler just do the same operation in all of these instances? I'm just curious if there is an instance when I should use bitwise operations and not trust the compiler to do the work for me.

Fixed:In correct statement of problem.

+3  A: 

I needed to change n to n-1 if n was even or n to (n+1) if n was odd.

In that case, regardless of efficiency, n = n ^ 1 is wrong.

For your second case, == will be just as efficient (if not more so) than any of the others.


In general, when it comes to optimization, you should benchmark it yourself. If a potential optimization is not worth benchmarking, it's not really worth making.

Anon.
I wrote that up wrong, my fault (fixed it). Also why will ==, be just as much if not more so efficient? Will it simply do the same operations that I'm writing and save overhead?
Jacob Schlather
`cmp` is a *single assembly instruction*. If you think anything involving lots of bitwise manipulation would be faster than that single instruction, you really shouldn't be trying to do low-level optimizations.
Anon.
A: 

A good compiler will optimize n%2 but you can always check the assembly produced to see. If you see divides, start optimizing it yourself because divide is about as slow as it gets.

David Kanarek
A: 

You should trust your compiler. gcc/++ is the product of years of development and it's capable of doing any optimizations you're probably thinking of doing. And, it's likely that if you start playing around, you'll tamper with it's efforts to optimize your code.

Juan
+5  A: 

It's easy enough to check, just fire up your disassembler. Take a look:

f.c:

unsigned int f1(unsigned int n)
{
  n ^= 1;  
  return n;
}

unsigned int f2(unsigned int n)
{
  if (n % 2)
    n=n-1;
  else
    n=n+1;

  return n;
}

Build and disassemble:

$ cc -O3 -c f.c 
$ otool -tV f.o 
f.o:
(__TEXT,__text) section
_f1:
00  pushq   %rbp
01  movq    %rsp,%rbp
04  xorl    $0x01,%edi
07  movl    %edi,%eax
09  leave
0a  ret
0b  nopl    _f1(%rax,%rax)
_f2:
10  pushq   %rbp
11  movq    %rsp,%rbp
14  leal    0xff(%rdi),%eax
17  leal    0x01(%rdi),%edx
1a  andl    $0x01,%edi
1d  cmovel  %edx,%eax
20  leave
21  ret

It looks like f1() is a bit shorter, whether or not that matters in reality is up to some benchmarking.

Carl Norum
`f2` appears to be "subtract 1, conditionally add 2".
Anon.
It not just that the instruction path for XOR is shorter. On <i>most</i> hardware bitwise instructions are faster than integer arithmetic, also, many compilers will add code to deal with integer overflows etc. when adding or subtracting.
James Anderson
What hardware has faster bitwise instructions than integer arithmetic? On ARM, at least, they're all the same.
Carl Norum
The big win of shorter code is typically that it will fit into cache better.
David Thornley
+2  A: 

About the only way to know for sure is to test. I'd have to agree that it would take a fairly clever compiler to produce as efficient of output for:

if(n%2) // or (n%2==0) and flip the order
    n=n-1
else
    n=n+1

as it could for n ^= 1;, but I haven't checked anything similar recently enough to say with any certainty.

As for your second question, I doubt it makes any difference -- an equality comparison is going to end up fast for any of these methods. If you want speed, the main thing to do is avoid having a branch involved at all -- e.g. something like:

if (a == b)
    c += d;

can be written as: c += d * (a==b);. Looking at the assembly language, the second will often look a bit messy (with ugly cruft to get the result of the comparison from the flags into a normal register) but still often perform better by avoiding any branches.

Edit: At least the compilers I have handy (gcc & MSVC), do not generate a cmov for the if, but they do generate a sete for the * (a==b). I expanded the code to something testable.

Edit2: Since Potatoswatter brought up another possibility using bit-wise and instead of multiplication, I decided to test that along with the others. Here's the code with that added:

#include <time.h>
#include <iostream>
#include <stdlib.h>

int addif1(int a, int b, int c, int d) { 
    if (a==b) 
        c+=d;
    return c;
}

int addif2(int a, int b, int c, int d) {
    return c += d * (a == b);
}

int addif3(int a, int b, int c, int d) {
    return c += d & -(a == b);
}

int main() {
    const int iterations = 50000;
    int x = rand();
    unsigned tot1 = 0;
    unsigned tot2 = 0;
    unsigned tot3 = 0;

    clock_t start1 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot1 +=addif1(i, j, i, x);
    }
    clock_t stop1 = clock();

    clock_t start2 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot2 +=addif2(i, j, i, x);
    }
    clock_t stop2 = clock();

    clock_t start3 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot3 +=addif3(i, j, i, x);
    }
    clock_t stop3 = clock();

    std::cout << "Ignore: " << tot1 << "\n";
    std::cout << "Ignore: " << tot2 << "\n";
    std::cout << "Ignore: " << tot3 << "\n";

    std::cout << "addif1: " << stop1-start1 << "\n";
    std::cout << "addif2: " << stop2-start2 << "\n";
    std::cout << "addif3: " << stop3-start3 << "\n";    
    return 0;
}

Now the really interesting part: the results for the third version are quite interesting. For MS VC++, we get roughly what most of us would probably expect:

Ignore: 2682925904
Ignore: 2682925904
Ignore: 2682925904
addif1: 4814
addif2: 3504
addif3: 3021

Using the & instead of the *, gives a definite improvement -- almost as much of an improvement as * gives over if. With gcc the result is quite a bit different though:

Ignore: 2680875904
Ignore: 2680875904
Ignore: 2680875904
addif1: 2901
addif2: 2886
addif3: 7675

In this case, the code using if is much closer to the speed of the code using *, but the code using & is slower than either one -- a lot slower! In case anybody cares, I found this surprising enough that I re-compiled a couple of times with different flags, re-ran a few times with each, and so on and the result was entirely consistent -- the code using & was consistently considerably slower.

The poor result with the third version of the code compiled with gcc gets us back to what I said to start with [and ends this edit]:

As I said to start with, "the only way to know for sure is to test" -- but at least in this limited testing, the multiplication consistently beats the if. There may be some combination of compiler, compiler flags, CPU, data pattern, iteration count, etc., that favors the if over the multiplication -- there's no question that the difference is small enough that a test going the other direction is entirely believable. Nonetheless, I believe that it's a technique worth knowing; for mainstream compilers and CPUs, it seems reasonably effective (though it's certainly more helpful with MSVC than with gcc).

[resumption of edit2:] the result with gcc using & demonstrates the degree to which 1) micro-optimizations can be/are compiler specific, and 2) how much different real-life results can be from expectations.

Jerry Coffin
`cmov` is great for the compiler to optimize that sort of stuff. The generated assembly would likely be adding c and d, followed by a conditional move into c if the condition is true. Possibly even faster on modern processors (if they can parellize the add and compare) than the multiply-by-the-flag based version (which is strictly serial and uses `mult`).
Anon.
Besides `cmov`, predictable branches are faster than multiplication. Multiplying by a Boolean is the kind of thing to test on real-world data after everything else is finished and you have absolutely nothing better to do.
Potatoswatter
Oh, also, depending on latency of mult, `c += d ` could be faster.
Potatoswatter
@Potatoswatter: That's probably the fastest, as long as you can live with slightly non-portable code (mostly theoretical -- it just won't work on 1s complement or sign/magnitude hardware, both of which are now pretty rare, at least for integers).
Jerry Coffin
`c += d ` but I'm sticking with "the branch is probably predictable and therefore free."
Potatoswatter
@Potatoswatter:You can stick to that if you want, but if you look at the test code above, you'll see proof that it's wrong. With the cast to unsigned, the code should work even on 1s complement or sign/magnitude hardware, but may not give a real advantage (but as I said before, such hardware is rare enough that few people have any reason to care).
Jerry Coffin
Your code proves nothing about OP's code. As I said, "predictable branches are faster than multiplication" and "test on real-world data" because that is the only way to know whether branches are predicted well or not. When I'm presented with a problematic branch, I try to improve predictability rather than find an algebraic substitution. Correctly predicted branches are cheap as free, as in zero cycles on most machines.
Potatoswatter
@Potatoswatter:to make a long story short, you're just plain wrong. See (for example) http://www.intel.com/Assets/PDF/manual/248966.pdf, §3.4.1.1 and http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF, §6.3. Oh, but wait: those are only written by Intel and AMD. Obviously **you** know more about how processors work than they ever could!
Jerry Coffin
Hmm, "3.4.1.1 Eliminating BranchesEliminating branches improves performance because:• It reduces the possibility of mispredictions.• It reduces the number of required branch target buffer (BTB) entries. Conditional branches, which are never taken, do not consume BTB resources." "6.3 Branches That Depend on Random Data OptimizationAvoid conditional branches that depend on random data, as these branches are difficult to predict." What did I say about prediction? You are on a roll, dude. Refuted by the first sentence of BOTH references you gave.
Potatoswatter
@PotatoSwatter: you claimed: "...correctly predicted branches are cheap as free..." Intel says: "...every branch counts. Even correctly predicted branches have a negative effect..." You argue with me regularly, and you lose every time. Let your intelligence win over your pig-headedness, and learn from this.
Jerry Coffin
Cheap as free does not mean absolutely free, it's an expression. You clipped the sentence: "Even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor." As long as branch instructions aren't numerous enough to starve the processor of other instructions, they are quite cheap and unlike eg a multiplication do not introduce a dependency. This isn't obscure knowledge.
Potatoswatter
First argument: you said that numeric casts produce undefined behavior and are unsafe while I said that behavior is defined not by C++ but by the hardware spec. You could not provide an example supporting your claim. I would say you were wrong on that one. Second argument: you made an empty claim that "my code was wrong" and your very specialized method of comparing FP numbers is "right", but again didn't provide an example or even a substantive claim. Again I think you looked foolish. Now, you claim that a chain of three integer operations incl a mult is cheaper than a predicted branch + add.
Potatoswatter
@Potatoswatter: So the bottom line is that you won't admit you're wrong even when the standard says so, you won't admit you're wrong even when testing proves it, and basically, as far as you care, the simple fact that you hold an opinion makes it "right" no matter what facts prove otherwise. You also forgot to mention your completely idiotic claim that a binary search is really a binary tree.
Jerry Coffin
Well, I suppose someone reading our arguments could come to one of two conclusions. I think you have curious ways of interpreting things and particular habits of "copy-paste errors" and declaring yourself the victor rather than present an argument. Also, I just ran your above test on my machine (Core2, GCC 4.2 -O3, OS X) and `addif1` came out ahead 549107 to 685495, or 20% faster. (I *did* add a call to `srand`.) The real lesson here is to avoid premature optimization and to try to be scientific.
Potatoswatter
+1  A: 

Is n^=1 faster than if ( n%2 ) --n; else ++n;? Yes. I wouldn't expect a compiler to optimize that. Since the bitwise operation is so much more succinct, it might be worthwhile to familiarize yourself with XOR and maybe add a comment on that line of code.

If it's really critical to the functionality of your program, it could also be considered a portability issue: if you test on your compiler and it's fast, you would likely be in for a surprise when trying on another compiler. Usually this isn't an issue for algebraic optimizations.

Is x^y faster than x==y? No. Doing things in roundabout ways is generally not good.

Potatoswatter
Why would checking for equality be faster than x^y? In order to check for equality, you have to do an X0R. There's absolutely no way to get around it. In machine code wouldn't the way to do it, be to do an X0R and then check if the zero flag has been thrown? As opposed to just doing the X0R and placing the result in a register, instead of tossing it.
Jacob Schlather
@liberalkid: In practice, everything simpler than multiplication is effectively the same speed because time is measured in cycles. The processor spends very little energy on performing simple operations compared to bookkeeping for parallel instruction execution. I never said either would be faster. Not all ISAs have an XOR operation that sets flags. On a register-starved architecture like x86, keeping an unneeded result could cause something else to get pushed onto the stack—very bad. Also XOR (short for exclusive or) is spelled with an oh, not a zero.
Potatoswatter
A: 

n ^= 1 and n1==n2 are probably the best you can do, but really, if you are after maximum efficiency, don't eyeball the code looking for little things like that.

Here's an example of how to really tune for performance.

Don't expect low level optimizations to help much until sampling has proven they are where you should focus.

Mike Dunlavey