views:

331

answers:

6

I need the most efficient way (in cpu cycles) to determine if two numbers have the same/different sign. But the catch is if either number is zero I need to be able to distinguish it from numbers with same/different signs (ie. zero is treated as a "third" sign). The following code is similar to what I need, but the return values can be anything as long as there are only three distinct return values.

int foo(int x, int y) {
    if (x * y > 0) return 1;
    if (x * y < 0) return -1;
    return 0;
}

For my specific problem, the values are in the range [-6, 6] and X is guaranteed to not be 0. I found a solution to find if two numbers have the same sign, and altered it to get the following solution.

return y? (((x^y) >= 0)? 1 : -1) : 0;

There should be some bitops/comparisons that give faster results than using multiplication, branching, comparisons.

+7  A: 

Your example doesn't work because you didn't put parenthesis around (x^y)

This is working:

return y? (((x^y) >= 0) ? 1 : -1) : 0;

I think you can't do much faster if you want to return -1, 1 or 0. This is because -1 is 11111111 and is quite different from 0 and 1. A set of bit operations that would return 11111111, 0 or 1 would be complicated and certainly slower than the code above.

EDIT: if instead of -1 and 1 you can cope with any negative or positive number, then you can eliminate a branch

return y ? ((x^y) | 1) : 0;
Tomaka17
The parentheses in your code aren't matched. Also, the code doesn't seem to return the correct result for x==0.
Martin B
OP said that x wasn't equal to 0, and my parentheses are matched
Tomaka17
With the second you get a negative value if the sign is different and a positive value if it is the same, but the value itself can change ; I proposed this since I don't know really know the usage
Tomaka17
I don't necessarily need return values of -1,0,1 anything is fine as long as there are only three distinct return values.
Justin
"A set of bit operations that would return 11111111, 0 or 1 would be complicated and certainly slower than the code above." That's so just not true. See drawnonward's solution. A mispredicted jump is more expensive than a boat load of bit operations, so I'll take the branch-free version any day. (Assuming we're talking cycle optimization)
roe
My intuition would tell me that drawnonward's solution is slower than mine (and Jukka Suomela's one looks faster), but I agree that the best thing to do is to try all the answers of this question and profile them
Tomaka17
his has 8 trivial operations (no divisions or anything like that) yours has two operations and a branch that'll stall the pipeline in case of a mispredication which, dependending on the data, will happen more or less often. Anyway, your function takes like 2-20 cycles, his takes 8. So if more than one in every three calls causes a misprediction, his wins.
roe
This depends highly on data, if you use a uniform distribution the branch will only fail 1 on 13 time ; but I must recognize that I was seriously wrong with "I think you can't do much faster"
Tomaka17
Well it'll go 'the other way' 1 in 13 times, no word as to which is the predicted branch. It might be that it's actually only correct 1 in 13 times... :) Also, Jukka Suomela's solution below has only 6 operations, and no branches.
roe
Branch prediction is dynamic. After the first branch (which is 50/50), the processor stores the branch result in the cache and uses it as a prediction for the next time. Two false predictions in a row (0.6% chance with uniform distribution) are required to change the prediction but if it happens, performance will drop for 4 branches
Tomaka17
Modern CPUs have rather sophisticated branch predictors, and not necessarily only using the last result. Older CPUs, or cheaper CPUs, not so much. Some architectures have the initial prediction right there in the op-code (I know PPC does), and some always use that (early SPARCs for example) according to Wikipedia. Either way, as you say, it's dynamic, and if you want performance, dynamic is rarely what you want.
roe
+12  A: 

How about:

int foo(int x,int y)
{
    // As suggested by Luther Blissett below in the comments.
    // Increased the size of the array to 16x16.
    // This allows for simpler optimization for the compiler
    // Also use +8 rather +6 in the hopes that compiler optimization will be easier
    // you never know (there may be some fancy trick.
    static int sign[16][16] = {
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1},
                { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1},
                { -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 1, 1, 1, 1, 1, 1}
            };

    return sign[x+8][y+8];
}

This should be fast as there is no branching that will stall the processor.

Using g++ -O3 -S:

__Z3fooii:
  pushl   %ebp
  movl    %esp, %ebp
  movl    8(%ebp), %eax
  movl    12(%ebp), %edx
  popl    %ebp
  sall    $4, %eax
  addl    %edx, %eax
  movl    _ZZ3fooiiE4sign+544(,%eax,4), %eax
  ret
Martin York
As usual brute force works well for small domains :)
Matthieu M.
There is a possibility that computing the value would be faster than using the table due to its frequent reloading to the processor cache. Depends on how the calling code is organized
Alsk
@Alsk: That may be true but definitely something that would need to be measured to be confirmed. But You also have to remeber that if there is any type of conditional (two in the OP question) then a processor pipeline stall is going to happen and that's not fast either (but faster than fetch from main memory to cache).
Martin York
I've thought of this, but I'm worried that the load instruction for the table index would be even more costly than multiplication.
Justin
@Justin: You could implement both and time them (or better profile them). From my experience, 'intuitive' detection of bottlenecks is not too reliable.
sum1stolemyname
@Justin: multiplication by constants is usually compiled to shifts and adds and thus very fast. Also, you can redeclare the sign array as sign[13][16], which means that only <<4 is required to compute the index.
Luther Blissett
@Luther Blissett : Done
Martin York
+1  A: 

You could do something like this (Only with proper variable names and done much less ugly!) Note that this ONLY works with 2s compliment numbers and if your values are limited to -6 to 6 as in your questions.

Profile it to make sure it's faster than the clear way of doing and ONLY write code like this once you have determined that you can't meet your requirements using a much more obvious approach. with branch prediction etc, branches aren't always slow on x86 for example. I would never write unportable code like this unless I had no choice to meet performance requirements.

Basically extract the sign bits and exclusive or them to get the result you want.

int foo(int x, int y)
{
    int s;

    if (x == 0 || y == 0) return 0;

    x = x >> 4; // Bit 0 of x will be the sign bit of x
    y = y >> 4; // Bit 0 of y will be the sign bit of y

    s = (x ^ y) & 1; // sign is 0 if they have the same sign, 1 otherwise

    return  1 - 2 * s;  // Make it 1 for the same sign, -1 otherwise
}

this compiles on my compiler to a couple of quick tests for zero and what looks like quite an efficient bit of bit maniplation after that...

    test    ecx, ecx
    je  SHORT $LN1@foo
    test    edx, edx
    je  SHORT $LN1@foo
; Line 12
    xor ecx, edx
    mov eax, 1
    sar ecx, 4
    and ecx, 1
    add ecx, ecx
    sub eax, ecx
; Line 13
    ret 0
$LN1@foo:
; Line 5
    xor eax, eax
; Line 13
    ret 0
John Burton
OP said `x` can't be 0, so you can drop that check.
IVlad
Well his first paragraph and example seemed to contradict that. If it's not possible, then yes
John Burton
+6  A: 

Edit:

((x*y)>>7) | -(-(x*y)>>7)

Above returns 1 if both are same sign, -1 if both are different signs.
Below returns 1 if both are positive, -1 if both are negative.

Assuming signed 32 bit values. With |x,y|<7 you could shift by 3.

  ((x&y)>>31)  // -1 or 0
-((-x&-y)>>31) //  1 or 0

((x&y)>>31) | -((-x&-y)>>31)

Assuming < is 1 or 0.

-((x&y)<0)     // -1 or 0
((-x&-y)<0)    //  1 or 0

-((x&y)<0) | ((-x&-y)<0)

Either way looks like 8 operations.

drawnonward
Excellent bit-fiddling, +1. The middle one is my favorite, although I get the feeling it should be possible to shave another operation off of it.. just can't put my finger on it.
roe
+5  A: 

Here is another version (with ugly, non-portable bit manipulation tricks):

int foo(int x, int y) {
    return ((x^y) >> 4) - ((x^(-y)) >> 4);
}

Some explanations:

  • ((x^y) >> 4) is -1 if exactly one of x and y is negative, otherwise it is 0.
  • ((x^(-y)) >> 4) is -1 if exactly one of x and -y is negative, otherwise it is 0.
  • If x > 0 and y > 0, the result will be 0 - (-1) = 1.
  • If x < 0 and y < 0, the result will be 0 - (-1) = 1.
  • If x > 0 and y = 0, the result will be 0 - 0 = 0.
  • If x < 0 and y = 0, the result will be (-1) - (-1) = 0.
  • If x > 0 and y < 0, the result will be (-1) - 0 = -1.
  • If x < 0 and y > 0, the result will be (-1) - 0 = -1.

Assumes two's complement arithmetic and assumes that >> shifts with sign-extension.

Jukka Suomela
That's better than mine
John Burton
+1, very elegant. is >> 4 cheaper than >> 31 in any situation, or why >> 4? (except, that we may use it, given the value ranges)
roe
How is it non-portable by the way? Except for requiring signed shifts and twos complement arithmetic?
roe
I put >> 4 on mine because it was all that was necessary for the question, and didn't require any assumptions about how big an int was on that platform.
John Burton
@roe: bit-shifting by any amount takes the same number of clock cycles on x86; I imagine it's the same on most other popular processors as well
BlueRaja - Danny Pflughoeft
@Martin - For every C or C++ compiler I've ever used, right shift zero-fills if you right-shift an *unsigned* type, but fills with the sign for a *signed* type. I'm not 100% sure that's mandated by the standards, but I'd be surprised if it's not. That is, the C shifts are arithmetic shifts, but an arithmetic shift on an unsigned type (where there is no sign bit to copy) is equivalent to a logical shift. Java invented >>> because Java doesn't have unsigned integer types. Bad solution IMO - I can never remember which is meant to be the logical rather than arithmetic shift.
Steve314
@BlueRaja - That's certainly true for current desktop chips, but barrel shifters requiring one or more clocks per bit are cheaper in silicon. They were used a lot back in ye olden days, and I wouldn't be surprised to find them on *really* minimal embedded processors etc even now.
Steve314
@Steve, @Martin: The C++ standard leaves the behavior implementation defined if the value is signed and negative, but requires zeros for signed non-negative values. So in C++ at least, sign extension is always a possibility.
Dennis Zickefoose
A: 

To express the sign of the number x as an "normalized" integer (i.e. -1, 0, +1) use

inline int sign(int x) { return (x > 0) - (x < 0); }

Deriving from the above, to compare x and y for sign equality use

inline bool same_sign(int x, int y) { 
  return sign(x) == sign(y);
}

for boolean result.

Or, for -1, 0, +1 result

inline int compare_sign(int x, int y) { 
  return sign(x) * sign(y);
}

How efficient you final code will be depends, of course, on the quality of the compiler you are using.

AndreyT