views:

183

answers:

2

I have a 64 bit number (but only the 42 low order bits are used) and need to computer the sum of the 4 bits at n, n+m, n+m*2 and n+m*3 (note: anything that can produce a sum >4 is invalid) for some fixed m and every value of n that places all the bits in the number

as an example using m=3 and given the 16-bit number

0010 1011 0110 0001

I need to compute

2, 3, 1, 2, 3, 0, 3

Does anyone have any (cool) ideas for ways to do this? I'm fine with bit twiddling.


My current thought is to make bit shifted copies of the input to align the values to be summed and then build a logic tree to do a 4x 1bit adder.

v1 = In;
v2 = In<<3;
v3 = In<<6;
v4 = In<<9;

a1 = v1 ^ v2;
a2 = v1 & v2;
b1 = v3 ^ v4;
b2 = v3 & v4;
c2 = a1 & b1;
d2 = a2 ^ b2;

o1 = a1 ^ b1;
o2 = c2 ^ d2;
o4 = a2 & b2;

This does end up with the bits of the result spread across 3 different ints but oh well.

edit: as it happens I need the histogram of the sums so doing a bit-count of o4, o2&o1, o2 and o1 gives me what I want.


a second solution uses a perfect hash function

arr = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4];

for(int i = 0; i < N; i++)
{
   out[i] = arr[(In & 0b1001001001) % 30]; 
   In >>= 1;
}

This works by noting that the 4 selected bits can only take on 16 patterns and that (by guess and check) they can be hashed into 0-15 using mod 30. From there, a table of computed values gives the needed sum. As it happens only 3 of the 4 strides I need work this way.


p.s.

Correct trumps fast. Fast trumps clear. I expect to be running this millions of time.

+2  A: 
dionadar
nice. clean. OTOH I need really blazing fast as this is part of the inner most loop of a tree search algorithm.
BCS
This doesn't work. He only wants four bits in each sum. Also, the % operator is best avoided, as it is usually terribly slow.
UncleO
I do not see anything that indicates that he would only want 4 bits per sum? "[...] computer the sum of the bits at [...]"yeah % may be rather slow, and "c[i] ...; if(++i == m) i = m;" will probably be faster (although you need to 0 i out between the loops then)
dionadar
% is now removed - the whole thing is now pointer instead of array based
dionadar
A little oops... after your rework of the question I understood what you wanted the first time >.>
dionadar
In looking at that it still has the problem that c[0] is the sum of 6 bits and I want exactly 4.
BCS
I probably have an obnoxious amount of fun with this^^
dionadar
I have the same vice. :b
BCS
could you post all 3 solutions? Just the inner code would do
BCS
I posted the whole "application" over there: http://pastebin.com/f5ce069b2
dionadar
The full code of the updated test application can be found over there: http://pastebin.com/f573c0c2
dionadar
If you are willing to spend ~0.5kb of memory, you can speed your perfect hash solution up to be the fastest of all (~1500 ms): you just need to precompile all 586 possible solutions and leave out your modulo 30.
dionadar
how do you get 586 solutions?
BCS
take your perfect hash solution, remove the "% 30" and make arr large enough to allow for arr[0x1001001001]
dionadar
+1  A: 

A suggestion that I don't want to code right now is to use a loop, an array to hold partial results, and constants to pick up the bits m at a time.

loop 
   s[3*i] += x & (1 << 0);
   s[3*i+1] += x & (1 << 1);
   s[3*i+2] += x & (1 << 2);
   x >> 3;

This will pick too many bits in each sum. But you can also keep track of the intermediate results and subtract from the sums as you go, to account for the bit that may not be there anymore.

loop 
   s[3*i] += p[3*i]   = x & (1 << 0);
   s[3*i+1] += p[3*i+1] = x & (1 << 1);
   s[3*i+2] += p[3*i+2] = x & (1 << 2);

   s[3*i] -= p[3*i-10];
   s[3*i+1] -= p[3*i-9];
   s[3*i+2] -= p[3*i-8];
   x >> 3;

with the appropriate bounds checking, of course.

The fastest approach is to just hardcode the sums themselves.

s[0] = (x & (1<<0)) + (x & (1<<3)) + (x & (1<<6)) + (x & (1<<9));

etc. (The shifts occur at compile time.)

UncleO
similar to the inital idea, this is not a very generic answer, but instead if implemented needs to use two loops. One is the one you wrote, the other one goes from 0 to m to do the adding and subtracting
dionadar
It doesn't have to be generic. m is fixed at 3.What do you mean by the second loop?
UncleO
ok, dump that, i just read the comment on the question
dionadar
The hard code is fast for each by it's self. What about all together?
BCS