views:

123

answers:

4

I have a 16 bit value with its bits "interlaced".

I want to get an array of 8 items (values 0 to 3) that stores the bits in this order:

  • item 0: bits 7 and 15
  • item 1: bits 6 and 14
  • item 2: bits 5 and 13
  • ...
  • item 7: bits 0 and 8

This is a trivial solution:

function uninterlace(n) {
  return [((n>>7)&1)|((n>>14)&2), // bits 7 and 15
          ((n>>6)&1)|((n>>13)&2), // bits 6 and 14
          ((n>>5)&1)|((n>>12)&2), // bits 5 and 13
          ((n>>4)&1)|((n>>11)&2), // bits 4 and 12
          ((n>>3)&1)|((n>>10)&2), // bits 3 and 11
          ((n>>2)&1)|((n>> 9)&2), // bits 2 and 10
          ((n>>1)&1)|((n>> 8)&2), // bits 1 and 9
          ((n>>0)&1)|((n>> 7)&2)];// bits 0 and 8
}

Does anyone knows a better (faster) way of doing this?

Edit:

Notes:

  • Build a precalculated table is not an option.
  • Cannot use assembler or CPU-specific optimizations
+1  A: 

Faster than a hand-written unrolled loop? I doubt it.

The code could be made less repetitive by using a for-loop, but that wouldn't make it run any faster.

Daniel Earwicker
+1  A: 
def uninterlace(n) {
    mask = 0x0101 // 0b0000_0001_0000_0001
    slide = 0x7f  // 0b0111_1111
    return [(((n >> 0) & mask) + slide) >> 7,
            (((n >> 1) & mask) + slide) >> 7,
            (((n >> 2) & mask) + slide) >> 7,
            (((n >> 3) & mask) + slide) >> 7,
            (((n >> 4) & mask) + slide) >> 7,
            (((n >> 5) & mask) + slide) >> 7,
            (((n >> 6) & mask) + slide) >> 7,
            (((n >> 7) & mask) + slide) >> 7]
}

This is only four operations per entry, instead of 5. The trick is in reusing the shifted value. The addition of slide moves the relevant bits adjacent to each other, and the shift by 7 puts them in the low-order position. The use of + might be a weakness.

A bigger weakness might be that each entry's operations must be done entirely in sequence, creating a latency of 4 instructions from entering a processor's pipeline to leaving it. These can be fully pipelined, but would still have some delay. The question's version exposes some instruction-level parallelism, and could potentially have a latency of only 3 instructions per entry, given sufficient execution resources.

It might be possible to combine multiple extractions into fewer operations, but I haven't seen a way to do it yet. The interlacing does, in fact, make that challenging.

Edit: a two-pass approach to treat the low and high order bits symmetrically, with interleaved 0's, shift them off from each other by one, and or the result could be much faster, and scalable to longer bitstrings.

Edited to correct the slide per Pedro's comment. Sorry to take up your time on my poor base-conversion skills. It was originally 0xef, which puts the 0 bit in the wrong place.

Novelocrat
Sorry, it was a nice idea but your function gives me different results. For an input of 0x1234 (00010010-00110100) result should be: 0,0,1,3,0,1,2,0. But your function returns: 1,3,1,1,3,1,1,1 (it's reversed +1). The reversed issue is easy to solve, but I don't see how to avoid the +1.
Pedro Ladaria
I realized the reversed after I posted, but was too lazy to edit, figuring that would be easy to resolve. I need to look more carefully at the +1 issue.
Novelocrat
if it helps... data is fetched from a byte array and then joined into a 16 bit integer so the input of the function can also be two 8 bit values.
Pedro Ladaria
Solved. The slide was wrong, it should be 0x7F (bin 01111111)
Pedro Ladaria
This is why languages need syntax for binary literals
Novelocrat
+1  A: 

Ok, now with 3 operations per item (tested and works).

This is a variation of Novelocrat's answer. It uses variable masks and slides.

function uninterlace(n) {     
     return [((n & 0x8080) + 0x3FFF) >> 14,
             ((n & 0x4040) + 0x1FFF) >> 13,
             ((n & 0x2020) + 0x0FFF) >> 12,
             ((n & 0x1010) + 0x07FF) >> 11,
             ((n & 0x0808) + 0x03FF) >> 10,
             ((n & 0x0404) + 0x01FF) >> 9,
             ((n & 0x0202) + 0x00FF) >> 8,
             ((n & 0x0101) + 0x007F) >> 7];
}
Pedro Ladaria
Well done. I should have put a little more thought into my answer.
Novelocrat
A: 

How about a small precalculated table of 128 entries times 2?

int[128] b1 = { 2, 3, 3, .. 3};
int[128] b0 = { 0, 1, 1, .. 1};

function uninterlace(n) {
  return [(n & 0x8000) ? b1 : b0)[n & 0x80],
          (n & 0x4000) ? b1 : b0)[n & 0x40],
          (n & 0x2000) ? b1 : b0)[n & 0x20],
          (n & 0x1000) ? b1 : b0)[n & 0x10],
          (n & 0x0800) ? b1 : b0)[n & 0x08],
          (n & 0x0400) ? b1 : b0)[n & 0x04],
          (n & 0x0200) ? b1 : b0)[n & 0x02],
          (n & 0x0100) ? b1 : b0)[n & 0x01]
         ];
}

This uses bit masking and table lookup instead of shifts and additions and might be faster.

rsp