ansaurus

Question

Answer 1

+1 A:

Modern architectures don't even address single bytes anymore; they address 4-byte words and extract the piece you requested. So on stock hardware, you might see an improvement by using 4 bytes per pair and extracting the pieces yourself. 4 bytes per entry might be faster too, but the cost of loading the second word is probably greater than the cost of masking and shifting. Or maybe not; modern processors are weird. Profile it and see!

David Seiler 2009-06-27 17:35:02

It's more accurate to say that, while modern processors still use bytes as the fundamental unit for their address space, their wide data busses and cache lines mean that if you only ask for a single byte, the CPU will probably end up loading an entire word into cache anyways. Determining the optimal data size for something like this is highly dependent on the architecture, which is probably not x86 given the extreme memory constraints.

2009-06-27 17:51:24

Answer 2

+2 A:

This is a classic case of trading off speed for memory efficiency. I assume that you are working in an environment where memory is scarce and you need to shove many, many items into this array, otherwise this probably isn't worth your time.

You can eliminate the switch statement by using a lookup table to find the correct shift and mask values.

short int shift1[4] = { 13, 7, 9, 2 };
short int shift2[4] = { 10, 4, 6, 0 };
short int mask1[4] = { 0xe000, 0x0380, 0x0e00, 0x38 };
short int mask2[4] = { 0x1c00, 0x0700, 0x1c, 0x07 };

int index = value % 4; /* you're not saving any time by using bitwise AND but you are making your code less readable */
value1 = (value & mask1[index]) >> shift1;
value2 = (value & mask2[index]) >> shift2;

The idea is that you eliminate any branching. However each path is so short that it just may not matter. In my testing (gcc on a PowerPC) there was barely any difference. However, the memory bandwidth on this machine is slow enough that both versions are faster than just using direct array accesses and 1 byte per value.

Matt Kane 2009-06-27 18:04:59

This saves about 4 seconds in my test! Thanks.

lgratian 2009-06-27 18:28:32

Answer 3

+2 A:

How about this? It eliminates memory accesses for the masks and shift values. (Of course, the (non-portable) assumption is that char is 8 bit and short is 16 bit. It is also assumed that index * 6 does not overflow int.)

void GetPair(char *array, int index, int &value1, int &value2)
{
   unsigned shift = 10 - index * 6 % 8;
   unsigned short data = (*(unsigned short *)(array + index * 6 / 8) >> shift) & 0x3f;
   value2 = data & 7;
   value1 = data >> 3;
}

There might be a penalty for reading a short crossing a 16 bit boundary, though. There used to be such issues way back when I was still keeping track of such things. If that is the case, it would probably be better to read a 32 bit value starting at a 16 bit boundary and adjust the shifts and masks accordingly.

Ari 2009-06-28 08:42:21

Mine: 25550 msMatt Kane: 23926 msAri: 23114 ms

lgratian 2009-06-28 11:07:26

Yeah, that's what I was really going for, but my brain wasn't quite working :)

Matt Kane 2009-06-28 13:52:32

I would suggest testing the 32 bit (or even 64 bit - depending on your hardware) approach as well. My experience is that these things often give counterintuitive results owing to the complexities of the caching and pipelining systems.

Ari 2009-07-01 09:50:30

ansaurus

tags:

views:

answers:

Array of pairs of 3 bit elements

related questions