ansaurus

Question

Fast dot product for a very special case

Answer 1

+6 A:

This really would require profiling but an alternative you might want to consider:

int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
    if ( X & mask ){
        result+=Y[i];
    }
    mask <<= 1;
}

Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.

As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).

Edit
In the comments below, the following good tweak has been proposed:

int result=0;
for ( int i = 0; i < L; i++ ){
    if ( X & 1 ){
        result+=Y[i];
    }
    X >>= 1;
}

Michael Aaron Safyan 2010-05-13 10:07:15

Thanks for interesting solution, but X must not be an array - it should an integer.

psihodelia 2010-05-13 10:14:34

@psihodelia, didn't understand the question. I've updated since.

Michael Aaron Safyan 2010-05-13 10:20:30

alternativly one can replance mask[bit] with -bit

sellibitze 2010-05-13 10:22:06

psihodelia 2010-05-13 10:28:44

@psihodelia, yes, it can be slightly optimized so that it only shifts by one. I've just made that change.

Michael Aaron Safyan 2010-05-13 10:45:32

You could think about loop unrolling, if your L is 32 or so it may make sense to think about completely eliminating conditional instructions (in loop and if) to maximally exploit parallelism. At least on Intel/AMD CPUs the most expensive part of this code seems to be branching.

Krystian 2010-05-13 10:54:32

@Krystian, good point, but I'll let "-funroll-loops" take care of that.

Michael Aaron Safyan 2010-05-13 10:57:13

I think performance will be better if you use the bitwise tricks as below to eliminate the conditional. Then when loop unrolling is applied you will probably get a completely branchless solution which I expect is optimal.

mikera 2010-05-13 15:16:40

Don't use a variable mask. use (X and 1) and then shift X right by one bit each time. The conditional is OK on a modern intel or AMD because a good compiler (Intel or GCC) will make use of a conditional (non-branching) instruction which won't stall the pipeline.

phkahler 2010-05-14 01:05:22

@phkahler, good suggestion.

Michael Aaron Safyan 2010-05-14 01:21:32

I suppose, that on some architectures where MUL operation takes one CPU tick, result+=Y[i]*X will be faster, because no branch is required.

psihodelia 2010-05-14 09:20:22

Eliminate the conditional! `result += Y[i] `

R.. 2010-09-20 02:59:40

Answer 2

+3 A:

Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel. SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.

John 2010-05-13 10:13:10

No, thanks. SSE is a machine-specific implementation. I am looking for a general algorithm.

psihodelia 2010-05-13 10:17:01

@psihodelia: You should definitely consider SSE; this problem would benefit enormously from stream processing.

Konrad Rudolph 2010-05-13 13:10:35

Ants Aasma 2010-05-14 01:27:11

That's pretty cool. Although Even I'd argue SSE4 is not mainstream enough to target right now.

John 2010-05-14 08:03:21

Answer 3

A:

You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:

inline int count(uint32_t x) {
    // see link
}

int dot(uint32_t a, uint32_t b) {
    return count(a & b);
}

For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.

sellibitze 2010-05-13 10:13:49

Answer 4

A:

Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...

Y & (-X)

for each element ... job done?

Edit2: To give a code example you can do something like this and avoid the branch:

int result=0;
for ( int i = 0; i < L; i++ )
{
   result+=Y[i] & -(int)((X >> i) & 1);
}

Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.

Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.

ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.

Another edit:

IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).

union int1632
{
     int32_t i32;
     int16_t i16[2];
};

int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
    int3264 y3264;
    y3264.i16[0] = Y[i + 0];
    y3264.i16[1] = Y[i + 1];

    int3264 x3264;
    x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
    x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);

    int3264 res3264;
    res3264.i32  = y3264.i32 & x3264.i32;

    result += res3264.i16[0] + res3264.i16[1];    
}

if ( i < L )
    result+=Y[i] & -(int)((X >> i) & 1);

Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

Goz 2010-05-13 10:28:23

I have checked your solution , but it gives me totally wrong answers. One has to find a dot product of a binary word with an array of integers.

psihodelia 2010-05-13 10:33:38

-1 Doesn't relate to the question at all really.

Elemental 2010-05-13 10:34:34

Well you DO have to add up the results earlier. You also need to promote the values in the Y vector to integers for it to work.

Goz 2010-05-13 10:35:53

Does a code example help? You'd still have to profile both ways because it IS plausible the branch is cheaper ... My method is, however, constant time.

Goz 2010-05-13 10:42:33

@Goz - looked at the worked example in the text; your algorithm cannot be made to do this.

Elemental 2010-05-13 10:42:54

@Goz - yea, thanks! Your algorithm indeed is very good. I have checked it, it gives good results.

psihodelia 2010-05-13 10:53:54

@Elemental: Do you mean the idea of doing 2, or more, at once? IT requires modifications to me algorithm above but it IS doable ...

Goz 2010-05-13 10:55:49

Updated slightly to reduce the X calculation by 1 op.

Goz 2010-05-13 11:10:05

Updated again to show a possible way to reduce the number of bitwise-ands by half for a small speed up.

Goz 2010-05-13 11:18:39

I wonder if you could optimize the code by shifting in the loop so as to only need 1 bit-shift by iteration (instead of `i`), I've posted a separate answer though because I don't know if this would not mess up with the loop unrolling...

Matthieu M. 2010-05-13 11:46:26

Answer 5

+2 A:

This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:

long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
  if (X & Lev)
     Result+=Y[i];
  Lev*=2;
}

I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.

Elemental 2010-05-13 10:40:22

MA has modded his solution to include this optimisation. He has my upvote

Elemental 2010-05-13 10:53:13

Answer 6

A:

Represente X using linked list of the places where x[i] = 1. To find required sum you need O(N) operations where N is size of your list.

Alexey Malistov 2010-05-13 10:44:43

This is a good idea if X is sparse AND X is relatively static. Obviously the fastest implementation would be some kind of contiguous array as opposed to the linked list suggested here.

Elemental 2010-05-13 10:51:22

@Elemental you mean sparse, relatively static, and large.

phkahler 2010-05-14 01:12:41

Yeah and the questioner implies that the size of X is less than 32 which definitely isn't large.

Elemental 2010-05-14 07:33:58

Answer 7

+2 A:

There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).

Krumelur 2010-05-13 10:46:21

Answer 8

+4 A:

Try this:

int result=0;
for ( int i = 0; i < L; i++ ){
    result+=Y[i] & (~(((X>>i)&1)-1));
}

This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.

mikera 2010-05-13 10:55:47

Alternative for the bitwise operator that also works is:result+=Y[i] Not sure which is faster....

mikera 2010-05-13 10:58:52

Logically your second method contains one less op so "ought" to be faster

Goz 2010-05-13 11:10:35

Yes I'd agree... though I've learnt over the years not to assume anything when it comes down to compiler optimisations and microprocessor quirks!

mikera 2010-05-13 11:17:38

Practically they run the same speed, as I've tested.

KennyTM 2010-05-13 11:18:59

Cool, thanks KennyTM (I didn't have a C compiler handy!). I've been working on the standard assumption that bitwise operations are basically free on modern processors for quite a while :-)

mikera 2010-05-13 12:14:36

Answer 9

+2 A:

I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/

Optimizing @Goz answer:

int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
   result += Y[i] & -(int)(x & 1);
}

Advantages:

no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits

Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).

Matthieu M. 2010-05-13 11:44:26

Nice. How about also incrementing i to skip the zeros in X at the beginning of the loop?It's extra code but it avoids memory accesses so it ought to be a net win.....

mikera 2010-05-13 11:54:53

Been thinking about this: since we've gone to the effort to eliminate branches with bitwise operations it probably doesn't make sense to add them back in with variable length loops, at least for the case where L is a small fixed number. If the ones and zeros are random 50% then you only save on average one memory access on each side of the vector.

mikera 2010-05-13 12:08:52

+1 Nice on the early out. It doesn't, however, provide a win if there is a 1 in the top bit however as you still do just as many right shifts. It may, however, be easier to pipeline for the compiler.

Goz 2010-05-13 12:39:41

Konrad Rudolph 2010-05-13 12:56:46

@Rudolph: I don't know, I just copied this bit from `Goz`. @mikera/Goz: I am not sure about the termination condition either as I am afraid it could screw up with the loop unrolling, however bit-shifting by 1 in each loop does seem faster :)

Matthieu M. 2010-05-13 13:47:09

Answer 10

+1 A:

result = 0;
for(int i = 0; i < L ; i++)
    if(X[i]!=0)
      result += Y[i];

TheMachineCharmer 2010-05-13 12:48:34

How is this superior to `if (X[i] == 1)`? And why isn’t it a good idea?

Konrad Rudolph 2010-05-13 12:50:34

It is in fact inferior to `if(X[i] == 1)`.It is not a good idea because having anything except `bool` in conditional statement is considered bad. :)

TheMachineCharmer 2010-05-13 12:58:14

@TheMachineCharmer: So then why not write `X[i] == 1` in your answer? From an efficiency point of view, it’s the same (!) and that would be a good answer, even though it uses a conditional which has been avoided completely by other answers.

Konrad Rudolph 2010-05-13 13:03:18

@Konrad Rudolph: Thanks!! :)

TheMachineCharmer 2010-05-13 15:00:42

Typically comparisons with zero are faster than comparisons with other values as most machines have a jump-if-zero, jump-if-not-zero, and other jump statements based on zero. Comparing with other values typically implies a subtraction followed by a comparison with zero.

Michael Aaron Safyan 2010-05-14 01:25:49

@Michael Aaron Safyan: That is great idea and answers the first question of Konrad Rudalf and clarifies my doubts. Thanks! :)

TheMachineCharmer 2010-05-14 07:32:33

Answer 11

+3 A:

Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:

int result = 0;
for (size_t i = 0; i < 32; ++i)
    result += Y[i] & -X[i];

Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as @Mathieu correctly noted). But omitting the if inside the loop does.

Of course, loop unrolling can speed this up drastically, as others have noted.

Konrad Rudolph 2010-05-13 14:04:29

I think, on some architectures, Y[i]*X[i] takes less instructions than AND and negation.

psihodelia 2010-05-14 09:23:50

@psihodelia: Less instructions does not necessarily mean less runtime (especially on a CISC architecture).

Drew Hall 2010-05-14 10:43:51

@psihodelia: That’s irrelevant. You can be pretty sure that negation + bitwise and will *always* be faster than multiplication.

Konrad Rudolph 2010-05-14 12:54:27

For a general algorithm it does matter how many operations are involved. On RISC architecture one operation should be faster than two.

psihodelia 2010-05-15 08:02:21

Answer 12

+1 A:

Joseph Quinsey 2010-05-13 14:12:10

I like this! My only question would be whether the instruction cache cost outweighs the obvious advantage in terms of number of operations?

mikera 2010-05-13 15:20:03

@mikera: **Time it.** It's the only way to tell. Alas, it's quite possible that the massively unrolled version won't be faster due to it being harder for the processor to hold all that code in its ICache (and do other useful work too).

Donal Fellows 2010-05-13 15:44:26

The larger problem is that this has 8 indirect jumps. Thats more than 100 cycles of penalty right there, not including the time to actually add stuff.

Ants Aasma 2010-05-14 00:57:37

Answer 13

+2 A:

How about combining a shifting loop with a small lookup table?

    int result=0;

    for ( int x=X; x!=0; x>>=4 ){
        switch (x&15) {
            case 0: break;
            case 1: result+=Y[0]; break;
            case 2: result+=Y[1]; break;
            case 3: result+=Y[0]+Y[1]; break;
            case 4: result+=Y[2]; break;
            case 5: result+=Y[0]+Y[2]; break;
            case 6: result+=Y[1]+Y[2]; break;
            case 7: result+=Y[0]+Y[1]+Y[2]; break;
            case 8: result+=Y[3]; break;
            case 9: result+=Y[0]+Y[3]; break;
            case 10: result+=Y[1]+Y[3]; break;
            case 11: result+=Y[0]+Y[1]+Y[3]; break;
            case 12: result+=Y[2]+Y[3]; break;
            case 13: result+=Y[0]+Y[2]+Y[3]; break;
            case 14: result+=Y[1]+Y[2]+Y[3]; break;
            case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
        }
        Y+=4;
    }

The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....

mikera 2010-05-13 15:36:28

I like your solution better than mine.

Joseph Quinsey 2010-05-13 15:47:35

You need to advance the Y pointer as well.

Alan 2010-05-13 19:46:54

Good spot - fixed!

mikera 2010-05-14 10:34:16

Answer 14

+1 A:

It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.

Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:

template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
  calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
  Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
    int Z = 0;
    calcZ<L-1>(X,Y,Z);
    return Z;
}

(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )

MSalters 2010-05-14 08:41:29

N.B. I don’t question that less memory transfer potentially increases performance; I questioned whether this was “obvious”. Of course, caching is a very important factor.

Konrad Rudolph 2010-05-14 12:58:52

ansaurus

tags:

views:

answers:

Fast dot product for a very special case

related questions