411

7
+5  Q:

## C/C++ optimization: negate doubles fast

I need to negate very large number of doubles quickly. If bit_generator generates 0, then the sign must be changed. If bit_generator generates 1, then nothing happens. The loop is run many times over and bit_generator is extremely fast. On my platform case 2 is noticeably faster than case 1. Looks like my CPU doesn't like branching. Is there any faster and portable way to do it? What do you think about case 3?

``````// generates 0 and 1
int bit_generator();

// big vector (C++)
vector<double> v;

// case 1
for (size_t i=0; i<v.size(); ++i)
if (bit_generator()==0)
v[i] = -v[i];

// case 2
const int sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];

// case 3
const double sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];

// case 4 uses C-array
double a[N];
double number_generator(); // generates doubles
double z[2]; // used as buffer
for (size_t i=0; i<N; ++i) {
z[0] = number_generator();
z[1] = -z[0];
a[i] = z[bit_generator()];
}
``````

EDIT: Added case 4 and C-tag, because the vector can be a plain array. Since I can control how doubles are generated, I redesigned the code as shown in case 4. It avoids extra multiplication and branching at the same. I presume it should be quite fast on all platforms.

+5  A:

if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:

``````v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
// bit_generator because 0 means
// negate and one means leave
// unchanged.
``````

In x86 the sign bit is the MSB, so for doubles that's bit 63:

``````#define SIGN_BIT_POSITION 63
``````

will do the trick.

Edit:

Based on comments, I should add that you might need to do some extra work to get this to compile, since `v` is an array of `double`, while `bit_generator()` returns `int`. You could do it like this:

``````union int_double {
double d;        // assumption: double is 64 bits wide
long long int i; // assumption: long long is 64 bits wide
};
``````

(syntax might be a bit different for C because you might need a typedef.)

Then define `v` as a vector of `int_double` and use:

``````v[i].i ^= bit_generator() << SIGN_BIT_POSITION;
``````
I need 100% portable solution.
+1 correct, but to amplify this is an processor architecture and compiler dependent optimization, the OP should clearly mark it as such in the code probably with appropriate reference to IEEE 754 - http://en.wikipedia.org/wiki/IEEE_754-2008
I don't think that will even compile if v[] is an array of doubles.
This is true for `float` and `double`, but not `int`. For `int`, you need to invert all the bits and then add 1 for two's complement. @m141: It's portable only for `float` and `double` if the platform uses IEEE 754 floating point (I don't know of one that doesn't these days).
@m141 - fast or portable; pick either one
@EJP: It can if you use the needed typecasts.
@Mike the OP explicitly specified `double`. For `int` you'd select a mask of all ones or all zeros and `xor` with that, then select either 0 or 1 and add that (or just add the output of `bit_generator()`)
There is no need to discuss this, because any modern compiler already **does the XOR neg if it is possible**. Doing this explicitly in code is really a not a good idea anymore (21th century).
@Luther You mean 21st ;)
@Luther, the problem is that if you don't do it explicitly, how will the compiler know? That is, unless you want to use a branch, which the OP explicitly doesn't.
@Nathan: Good point. But I am not still not sure if the method saves anything. On most architectures, the FP and Integer units run on different internal paths. Doing the XOR explicitly means that I have to move the FP number into the Integer Unit, XOR it and move it back. This is costly (and proably the reason why SSE2 has bit instructions for packed doubles and floats). Do you mind if I amend your answer with the assembly produced by gcc?
@Luther, considering that `v` will most likely be in memory to begin with, by specifying the integer flavor of the union, I'd expect the compiler to generate only integer instructions for this operation. However, please feel free to add the assembly that gcc produces.
Isn't writing to one union member and then reading from another at a later point in time undefined behavior? Doesn't sound very portable to me :)
@Fred: good point.. I'm not sure.
@Mike DeSimone, in other words it won't compile as shown.
@EJP: Yes, but I wanted to be helpful by pointing out why.
+3  A:

Generally, if you have an `if()` inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.

For fun, if you're using GCC, use the `-S -o foo.S -c foo.c` flags instead of the usual `-o foo.o -c foo.c` flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.

+2  A:

You don't need the lookup table, a simple formula suffices:

``````const size_t size = v.size();
for (size_t i=0; i<size; ++i)
v[i] *= 2*bit_generator() - 1;
``````
The problem with this is that multiplication is generally slow, especially for doubles. On the plus side, this *is* portable.
@Mark B: I'm not sure about doubles, but multiplication of integers by 2 is simple shift (`bit_generator()` returns an `int`)
@Mark: not if the double in question is denormalized.
@Mark: I don't know what you mean by "double multiplicaton". `2*x-1` simply maps 0 to -1 and 1 to 1.
Yes, it is just as fast as case 2.
+9  A:

Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.

``````const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
if (bit_generator()==0)
v[i] = -v[i];
``````

If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.

UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:

``````const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
int val=2*bit_generator()+bit_generator();
switch(val) // only one conditional
{
case 0:
break; // nothing happes
case 1:
v[i+1]=-v[i+1];
break;
case 2:
v[i]=-v[i];
break;
case 3:
v[i]=-v[i];
v[i+1]=-v[i+1];
}
}
// not shown: wrap up the loop if SZ%2==1
``````
I was still preparing more material.
Nice answer, but using int for SZ is not portable. v.size() is unsigned, with size large enough to hold a memory address. I don't know of any platform where that is the same as int.
I made that size_t. I usually avoid unsigned in loop indices, because of 'x-1 > x' problems.
In `2*f()+g();`, the order of evaluation of `f()` and `g()` is unspecified. So, you can't be sure that `2*bit_generator()+bit_generator();` is 1 when `v[i+1]` needs to be negated, and 2 when `v[i]` needs to be negated. A better way would be `val = 2*bit_generator(); val += bit_generator();`
Sure. I assumed this was some kind of random generator in which case the evaluation order didn't matter.
Yes, bits are random. The order of evaluation would matter if the bits aren't equally probable.
A:

Are you able to rewrite `bit_generator` so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.

+1  A:

assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:

``````v[i] = bit_generator() ? v[i] : -v[i];
``````

This avoids branches and allows the compiler to vectorize the loop and make it faster.

How does this avoid generating a branch?
Because SSE allows you to use masks to select one of two sources in one instruction without using a branch.
I tried it. It is the same as case 1 on my platform. Maybe I need to enable SSE?
you should try it
A:

## Premature optimization is the root of insipid SO questions

On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:

``````p->d = -p->d         7.33 ns
p->MSB(d) ^= 0x80    6.94 ns
``````

Where MSB(d) is pseudo-code for grabbing the most significant byte of `d`. This means that the naive `d = -d` takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.

Someone must have an awfully big pile of doubles to care about that optimization.

Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.

More like 100M doubles.
Last I checked, 1'000'000'000 was greater than 100'000'000 by approximately a factor of 10. Is your day so frantic you need to save 39 milliseconds?
It is more like 500ms per 100M doubles on my platform. It seems that the version which uses branching is more platform dependent (I mean speed) than the case 4.