views:

552

answers:

14

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.

Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).

It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?

+6  A: 

This really would require profiling but an alternative you might want to consider:

int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
    if ( X & mask ){
        result+=Y[i];
    }
    mask <<= 1;
}

Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.

As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).

Edit
In the comments below, the following good tweak has been proposed:

int result=0;
for ( int i = 0; i < L; i++ ){
    if ( X & 1 ){
        result+=Y[i];
    }
    X >>= 1;
}
Michael Aaron Safyan
Thanks for interesting solution, but X must not be an array - it should an integer.
psihodelia
@psihodelia, didn't understand the question. I've updated since.
Michael Aaron Safyan
alternativly one can replance mask[bit] with -bit
sellibitze
psihodelia
@psihodelia, yes, it can be slightly optimized so that it only shifts by one. I've just made that change.
Michael Aaron Safyan
You could think about loop unrolling, if your L is 32 or so it may make sense to think about completely eliminating conditional instructions (in loop and if) to maximally exploit parallelism. At least on Intel/AMD CPUs the most expensive part of this code seems to be branching.
Krystian
@Krystian, good point, but I'll let "-funroll-loops" take care of that.
Michael Aaron Safyan
I think performance will be better if you use the bitwise tricks as below to eliminate the conditional. Then when loop unrolling is applied you will probably get a completely branchless solution which I expect is optimal.
mikera
Don't use a variable mask. use (X and 1) and then shift X right by one bit each time. The conditional is OK on a modern intel or AMD because a good compiler (Intel or GCC) will make use of a conditional (non-branching) instruction which won't stall the pipeline.
phkahler
@phkahler, good suggestion.
Michael Aaron Safyan
I suppose, that on some architectures where MUL operation takes one CPU tick, result+=Y[i]*X will be faster, because no branch is required.
psihodelia
Eliminate the conditional! `result += Y[i] `
R..
+3  A: 

Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel. SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.

John
No, thanks. SSE is a machine-specific implementation. I am looking for a general algorithm.
psihodelia
@psihodelia: You should definitely consider SSE; this problem would benefit enormously from stream processing.
Konrad Rudolph
Ants Aasma
That's pretty cool. Although Even I'd argue SSE4 is not mainstream enough to target right now.
John
A: 

You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:

inline int count(uint32_t x) {
    // see link
}

int dot(uint32_t a, uint32_t b) {
    return count(a & b);
}

For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.

sellibitze
A: 

Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...

Y & (-X)

for each element ... job done?

Edit2: To give a code example you can do something like this and avoid the branch:

int result=0;
for ( int i = 0; i < L; i++ )
{
   result+=Y[i] & -(int)((X >> i) & 1);
}

Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.

Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.

ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.

Another edit:

IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).

union int1632
{
     int32_t i32;
     int16_t i16[2];
};

int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
    int3264 y3264;
    y3264.i16[0] = Y[i + 0];
    y3264.i16[1] = Y[i + 1];

    int3264 x3264;
    x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
    x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);

    int3264 res3264;
    res3264.i32  = y3264.i32 & x3264.i32;

    result += res3264.i16[0] + res3264.i16[1];    
}

if ( i < L )
    result+=Y[i] & -(int)((X >> i) & 1);

Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

Goz
I have checked your solution , but it gives me totally wrong answers. One has to find a dot product of a binary word with an array of integers.
psihodelia
-1 Doesn't relate to the question at all really.
Elemental
Well you DO have to add up the results earlier. You also need to promote the values in the Y vector to integers for it to work.
Goz
Does a code example help? You'd still have to profile both ways because it IS plausible the branch is cheaper ... My method is, however, constant time.
Goz
@Goz - looked at the worked example in the text; your algorithm cannot be made to do this.
Elemental
@Goz - yea, thanks! Your algorithm indeed is very good. I have checked it, it gives good results.
psihodelia
@Elemental: Do you mean the idea of doing 2, or more, at once? IT requires modifications to me algorithm above but it IS doable ...
Goz
Updated slightly to reduce the X calculation by 1 op.
Goz
Updated again to show a possible way to reduce the number of bitwise-ands by half for a small speed up.
Goz
I wonder if you could optimize the code by shifting in the loop so as to only need 1 bit-shift by iteration (instead of `i`), I've posted a separate answer though because I don't know if this would not mess up with the loop unrolling...
Matthieu M.
+2  A: 

This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:

long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
  if (X & Lev)
     Result+=Y[i];
  Lev*=2;
}

I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.

Elemental
MA has modded his solution to include this optimisation. He has my upvote
Elemental
A: 

Represente X using linked list of the places where x[i] = 1. To find required sum you need O(N) operations where N is size of your list.

Alexey Malistov
This is a good idea if X is sparse AND X is relatively static. Obviously the fastest implementation would be some kind of contiguous array as opposed to the linked list suggested here.
Elemental
@Elemental you mean sparse, relatively static, and large.
phkahler
Yeah and the questioner implies that the size of X is less than 32 which definitely isn't large.
Elemental
+2  A: 

There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).

Krumelur
+4  A: 

Try this:

int result=0;
for ( int i = 0; i < L; i++ ){
    result+=Y[i] & (~(((X>>i)&1)-1));
}

This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.

mikera
Alternative for the bitwise operator that also works is:result+=Y[i] Not sure which is faster....
mikera
Logically your second method contains one less op so "ought" to be faster
Goz
Yes I'd agree... though I've learnt over the years not to assume anything when it comes down to compiler optimisations and microprocessor quirks!
mikera
Practically they run the same speed, as I've tested.
KennyTM
Cool, thanks KennyTM (I didn't have a C compiler handy!). I've been working on the standard assumption that bitwise operations are basically free on modern processors for quite a while :-)
mikera
+2  A: 

I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/

Optimizing @Goz answer:

int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
   result += Y[i] & -(int)(x & 1);
}

Advantages:

  • no need to do i bit-shifting operations each time (X>>i)
  • the loop stops sooner if X contains 0 in higher bits

Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).

Matthieu M.
Nice. How about also incrementing i to skip the zeros in X at the beginning of the loop?It's extra code but it avoids memory accesses so it ought to be a net win.....
mikera
Been thinking about this: since we've gone to the effort to eliminate branches with bitwise operations it probably doesn't make sense to add them back in with variable length loops, at least for the case where L is a small fixed number. If the ones and zeros are random 50% then you only save on average one memory access on each side of the vector.
mikera
+1 Nice on the early out. It doesn't, however, provide a win if there is a 1 in the top bit however as you still do just as many right shifts. It may, however, be easier to pipeline for the compiler.
Goz
Konrad Rudolph
@Rudolph: I don't know, I just copied this bit from `Goz`. @mikera/Goz: I am not sure about the termination condition either as I am afraid it could screw up with the loop unrolling, however bit-shifting by 1 in each loop does seem faster :)
Matthieu M.
+1  A: 
result = 0;
for(int i = 0; i < L ; i++)
    if(X[i]!=0)
      result += Y[i];
TheMachineCharmer
How is this superior to `if (X[i] == 1)`? And why isn’t it a good idea?
Konrad Rudolph
It is in fact inferior to `if(X[i] == 1)`.It is not a good idea because having anything except `bool` in conditional statement is considered bad. :)
TheMachineCharmer
@TheMachineCharmer: So then why not write `X[i] == 1` in your answer? From an efficiency point of view, it’s the same (!) and that would be a good answer, even though it uses a conditional which has been avoided completely by other answers.
Konrad Rudolph
@Konrad Rudolph: Thanks!! :)
TheMachineCharmer
Typically comparisons with zero are faster than comparisons with other values as most machines have a jump-if-zero, jump-if-not-zero, and other jump statements based on zero. Comparing with other values typically implies a subtraction followed by a comparison with zero.
Michael Aaron Safyan
@Michael Aaron Safyan: That is great idea and answers the first question of Konrad Rudalf and clarifies my doubts. Thanks! :)
TheMachineCharmer
+3  A: 

Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:

int result = 0;
for (size_t i = 0; i < 32; ++i)
    result += Y[i] & -X[i];

Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as @Mathieu correctly noted). But omitting the if inside the loop does.

Of course, loop unrolling can speed this up drastically, as others have noted.

Konrad Rudolph
I think, on some architectures, Y[i]*X[i] takes less instructions than AND and negation.
psihodelia
@psihodelia: Less instructions does not necessarily mean less runtime (especially on a CISC architecture).
Drew Hall
@psihodelia: That’s irrelevant. You can be pretty sure that negation + bitwise and will *always* be faster than multiplication.
Konrad Rudolph
For a general algorithm it does matter how many operations are involved. On RISC architecture one operation should be faster than two.
psihodelia
+1  A: 
Joseph Quinsey
I like this! My only question would be whether the instruction cache cost outweighs the obvious advantage in terms of number of operations?
mikera
@mikera: **Time it.** It's the only way to tell. Alas, it's quite possible that the massively unrolled version won't be faster due to it being harder for the processor to hold all that code in its ICache (and do other useful work too).
Donal Fellows
The larger problem is that this has 8 indirect jumps. Thats more than 100 cycles of penalty right there, not including the time to actually add stuff.
Ants Aasma
+2  A: 

How about combining a shifting loop with a small lookup table?

    int result=0;

    for ( int x=X; x!=0; x>>=4 ){
        switch (x&15) {
            case 0: break;
            case 1: result+=Y[0]; break;
            case 2: result+=Y[1]; break;
            case 3: result+=Y[0]+Y[1]; break;
            case 4: result+=Y[2]; break;
            case 5: result+=Y[0]+Y[2]; break;
            case 6: result+=Y[1]+Y[2]; break;
            case 7: result+=Y[0]+Y[1]+Y[2]; break;
            case 8: result+=Y[3]; break;
            case 9: result+=Y[0]+Y[3]; break;
            case 10: result+=Y[1]+Y[3]; break;
            case 11: result+=Y[0]+Y[1]+Y[3]; break;
            case 12: result+=Y[2]+Y[3]; break;
            case 13: result+=Y[0]+Y[2]+Y[3]; break;
            case 14: result+=Y[1]+Y[2]+Y[3]; break;
            case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
        }
        Y+=4;
    }

The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....

mikera
I like your solution better than mine.
Joseph Quinsey
You need to advance the Y pointer as well.
Alan
Good spot - fixed!
mikera
+1  A: 

It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.

Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:

template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
  calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
    int Z = 0;
    calcZ<L-1>(X,Y,Z);
    return Z;
}

(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )

MSalters
N.B. I don’t question that less memory transfer potentially increases performance; I questioned whether this was “obvious”. Of course, caching is a very important factor.
Konrad Rudolph