ansaurus

Question

Answer 1

+5 A:

if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:

v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
                                               // bit_generator because 0 means 
                                               // negate and one means leave 
                                               // unchanged.

In x86 the sign bit is the MSB, so for doubles that's bit 63:

#define SIGN_BIT_POSITION 63

will do the trick.

Edit:

Based on comments, I should add that you might need to do some extra work to get this to compile, since v is an array of double, while bit_generator() returns int. You could do it like this:

union int_double {
    double d;        // assumption: double is 64 bits wide
    long long int i; // assumption: long long is 64 bits wide
};

(syntax might be a bit different for C because you might need a typedef.)

Then define v as a vector of int_double and use:

v[i].i ^= bit_generator() << SIGN_BIT_POSITION;

Nathan Fellman 2010-07-29 12:11:30

I need 100% portable solution.

2010-07-29 12:13:56

+1 correct, but to amplify this is an processor architecture and compiler dependent optimization, the OP should clearly mark it as such in the code probably with appropriate reference to IEEE 754 - http://en.wikipedia.org/wiki/IEEE_754-2008

msw 2010-07-29 12:14:53

I don't think that will even compile if v[] is an array of doubles.

EJP 2010-07-29 12:15:20

This is true for `float` and `double`, but not `int`. For `int`, you need to invert all the bits and then add 1 for two's complement. @m141: It's portable only for `float` and `double` if the platform uses IEEE 754 floating point (I don't know of one that doesn't these days).

Mike DeSimone 2010-07-29 12:15:29

@m141 - fast or portable; pick either one

msw 2010-07-29 12:15:33

@EJP: It can if you use the needed typecasts.

Mike DeSimone 2010-07-29 12:17:16

@Mike the OP explicitly specified `double`. For `int` you'd select a mask of all ones or all zeros and `xor` with that, then select either 0 or 1 and add that (or just add the output of `bit_generator()`)

Nathan Fellman 2010-07-29 12:24:15

There is no need to discuss this, because any modern compiler already **does the XOR neg if it is possible**. Doing this explicitly in code is really a not a good idea anymore (21th century).

Luther Blissett 2010-07-29 12:26:49

@Luther You mean 21st ;)

FredOverflow 2010-07-29 12:28:13

@Luther, the problem is that if you don't do it explicitly, how will the compiler know? That is, unless you want to use a branch, which the OP explicitly doesn't.

Nathan Fellman 2010-07-29 12:36:36

@Nathan: Good point. But I am not still not sure if the method saves anything. On most architectures, the FP and Integer units run on different internal paths. Doing the XOR explicitly means that I have to move the FP number into the Integer Unit, XOR it and move it back. This is costly (and proably the reason why SSE2 has bit instructions for packed doubles and floats). Do you mind if I amend your answer with the assembly produced by gcc?

Luther Blissett 2010-07-29 13:01:40

@Luther, considering that `v` will most likely be in memory to begin with, by specifying the integer flavor of the union, I'd expect the compiler to generate only integer instructions for this operation. However, please feel free to add the assembly that gcc produces.

Nathan Fellman 2010-07-29 13:20:47

Isn't writing to one union member and then reading from another at a later point in time undefined behavior? Doesn't sound very portable to me :)

FredOverflow 2010-07-29 14:58:28

@Fred: good point.. I'm not sure.

Nathan Fellman 2010-07-29 15:57:46

@Mike DeSimone, in other words it won't compile as shown.

EJP 2010-07-30 10:33:22

@EJP: Yes, but I wanted to be helpful by pointing out why.

Mike DeSimone 2010-07-30 20:05:23

Answer 2

+3 A:

Generally, if you have an if() inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.

For fun, if you're using GCC, use the -S -o foo.S -c foo.c flags instead of the usual -o foo.o -c foo.c flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.

Mike DeSimone 2010-07-29 12:13:51

Answer 3

+2 A:

You don't need the lookup table, a simple formula suffices:

const size_t size = v.size();
for (size_t i=0; i<size; ++i)
    v[i] *= 2*bit_generator() - 1;

FredOverflow 2010-07-29 12:21:47

The problem with this is that multiplication is generally slow, especially for doubles. On the plus side, this *is* portable.

Nathan Fellman 2010-07-29 12:25:57

@Mark B: I'm not sure about doubles, but multiplication of integers by 2 is simple shift (`bit_generator()` returns an `int`)

jpalecek 2010-07-29 13:56:43

@Mark: not if the double in question is denormalized.

Justin K 2010-07-29 14:10:54

@Mark: I don't know what you mean by "double multiplicaton". `2*x-1` simply maps 0 to -1 and 1 to 1.

FredOverflow 2010-07-29 14:11:22

Yes, it is just as fast as case 2.

2010-07-30 03:09:19

Answer 4

+9 A:

Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.

const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
    if (bit_generator()==0)
        v[i] = -v[i];

If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.

UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:

const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
 int val=2*bit_generator()+bit_generator();
 switch(val) // only one conditional
 {
  case 0: 
     break; // nothing happes
  case 1: 
     v[i+1]=-v[i+1]; 
     break; 
  case 2: 
     v[i]=-v[i]; 
     break; 
  case 3: 
    v[i]=-v[i];
    v[i+1]=-v[i+1]; 
 }
}
// not shown: wrap up the loop if SZ%2==1

Luther Blissett 2010-07-29 12:46:56

this doesn't answer the question...

Nathan Fellman 2010-07-29 13:28:00

I was still preparing more material.

Luther Blissett 2010-07-29 13:35:37

Nice answer, but using int for SZ is not portable. v.size() is unsigned, with size large enough to hold a memory address. I don't know of any platform where that is the same as int.

Ryan 2010-07-29 13:42:25

I made that size_t. I usually avoid unsigned in loop indices, because of 'x-1 > x' problems.

Luther Blissett 2010-07-29 13:45:32

In `2*f()+g();`, the order of evaluation of `f()` and `g()` is unspecified. So, you can't be sure that `2*bit_generator()+bit_generator();` is 1 when `v[i+1]` needs to be negated, and 2 when `v[i]` needs to be negated. A better way would be `val = 2*bit_generator(); val += bit_generator();`

Alok 2010-07-29 23:53:54

Sure. I assumed this was some kind of random generator in which case the evaluation order didn't matter.

Luther Blissett 2010-07-30 00:28:25

Yes, bits are random. The order of evaluation would matter if the bits aren't equally probable.

2010-07-30 03:00:53

Answer 5

A:

Are you able to rewrite bit_generator so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.

Mark B 2010-07-29 13:29:57

Answer 6

+1 A:

assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:

v[i] = bit_generator() ? v[i] : -v[i];

This avoids branches and allows the compiler to vectorize the loop and make it faster.

Nathan Fellman 2010-07-29 13:35:49

How does this avoid generating a branch?

Schedler 2010-07-29 14:21:26

Because SSE allows you to use masks to select one of two sources in one instruction without using a branch.

Nathan Fellman 2010-07-29 15:58:23

I tried it. It is the same as case 1 on my platform. Maybe I need to enable SSE?

2010-07-29 22:29:13

you should try it

Nathan Fellman 2010-07-30 04:02:03

Answer 7

A:

Premature optimization is the root of insipid SO questions

On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:

p->d = -p->d         7.33 ns
p->MSB(d) ^= 0x80    6.94 ns

Where MSB(d) is pseudo-code for grabbing the most significant byte of d. This means that the naive d = -d takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.

Someone must have an awfully big pile of doubles to care about that optimization.

Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.

msw 2010-07-29 20:55:14

More like 100M doubles.

2010-07-29 22:32:15

Last I checked, 1'000'000'000 was greater than 100'000'000 by approximately a factor of 10. Is your day so frantic you need to save 39 milliseconds?

msw 2010-07-29 22:46:52

It is more like 500ms per 100M doubles on my platform. It seems that the version which uses branching is more platform dependent (I mean speed) than the case 4.

2010-07-30 03:24:12

ansaurus

tags:

views:

answers:

C/C++ optimization: negate doubles fast

Premature optimization is the root of insipid SO questions

related questions