views:

127

answers:

5

I have 64 bit values that I want to compress by exploiting the fact that only a portion somewhere in the middle contains data and before and after that are zeroes.

Say the actual data is l bits long and padded with n 0s in front and m 0s at the end such that n + l + m = 64. Instead of transmitting / storing 64 bits, I can transmit l bits plus whatever I need to encode the position of the data in the 64-bit interval.

For example, say I was storing l, m and the data bits, then I would restore the original 64-bit pattern by reading l, reading l bits of data, reading m and shifting the data m bits to the left.

The smallest overhead I could come up with is two times 6 bits for storing either two of l, n and m (each can be between 0 and 64). Is it possible to reduce that number?

+3  A: 

As you have stated the problem, no you cannot do better that the solution you have proposed.

However, if the distribution of the zeros in the numbers is skewed, you may be able to get better compression on average by using Huffman codes or a similar technique to represent the counts. Another possibility is to use delta coding if the zero distribution is strongly correlated from one 64bit value to the next.

In either case, you will need to use a variable number of bits to represent the numbers of zeros. And if your assumptions about skewedness or correlation turn out to be false, you may end up using more bits on average than if you had done it the simple way.

Stephen C
In general, the distribution of zeroes is not skewed, but the original values appear in large groups where the leading zeroes are the same and the trailing zeroes are very similar, so delta encoding would work well. On a quick glance at Hamming codes, I couldn't see how this shortens the codeword, it's an error checking code and actually adds (validation) information, doesn't it?
Hanno Fietz
Did you confuse Hamming and Huffman?
starblue
@starblue: Yes I did. Corrected now. Thanks
Stephen C
+1  A: 

Your solution seems pretty good.
Huffman coding is another way to compress your values especially if there are values with great frequency.

It's not very difficult to implement it, but it might be overwhelming if you don't have much data to transmit.

Nick D
+4  A: 

Your analysis sounds right for single vlaues. But if you're transmitting lots of such values together, a generic entropy encoding algorithm like gzip will probably do better, since it can eliminate the strings of zeroes quite well and also exploit redundancies in the data.

Michael Borgwardt
+1 for gzip, though it's not immediately useful for me now, it's a good thing to keep in mind, I think.
Hanno Fietz
+2  A: 

l can be from 0 to 64, so don't send l, send n and m, since they can both be zero, and don't need to go up to 64 (they simply need to be able to add to 64).

The l bits must start and end with a 1, so they do not need to be transmitted.

send 6 bits for n
send up to 6 bits for m (see below)
calculate l = 64 - (n + m)
if l = 0, the number is 0, don't send anything else
if l = 1, the number is 1 * 2^m, don't send anything else
if l = 2, the number is 3 * 2^m, don't send anything else
send the middle l - 2 bits.

Maximum overhead = 10 bits.

The reduction in the bits for m is because
if n > 32 then you know m < 32, so only needs 5 bits
if n > 48 then you know m < 16, so only needs 4 bits
if n > 56 then you know m < 8, so only needs 3 bits
if n > 60 then you know m < 4, so only needs 2 bits
if n = 63 then you know m < 2, so only needs 1 bit

Stephen Denne
10 bits is overhead... meaning at most l + 10 bits will be needed.
Stephen Denne
Right, I can leave 2 bits off the payload, too, nice! And good point about deducing the bitlength of m from the value of n.
Hanno Fietz
If you decide to always encode zero as n=63,m=1 then all the ">" in the above can be ">=" thereby adding a few more instances where one less bit needs to be sent.
Stephen Denne
A: 

There are 64 possible start positions n of the sequence of ones and the length of the sequence l can be no longer then 64 - n. So there a

r = sum(n = 0..63, 64 - n) + 1

sequences in total. The added one is for a sequence of all zeros. Doing some math yields the following.

r = 64 * 64 - (63 * 64) / 2 + 1
  = 2081

Representing 2081 possible values requires log2(2081) = 11.023 bits. Your suggestion to encode the information using two 6 bit numbers requiring 12 bits in total is hence optimal (under the assumption of equal distributions of all possible values).

Daniel Brückner
I understood the question as meaning the data portion in the middle could contain 1s and 0s, not as meaning the data portion was contiguous 1s.
Stephen Denne
You are right - I got the middle part completly wrong. Thanks for pointing that out.
Daniel Brückner
This includes the sequence of length l = 0 starting at positions n = 0..63. You only need to encode one of those, so there are only 2018 values for which 11 bits are enough.
Henrik
(Besides that my answer missed the question) the sequence of all zeros is included only ones. For example take n = 0. There are 64 sequences with 1 to 64 bits set, hence (64 - n) does not include a sequence of all zeros, but (65 - n) would. This goes up to n = 63 where there is only 1 sequence with 1 bit set. The sequence of all zeros is explicitly added by the term "+ 1" after the sum.
Daniel Brückner
Yes, you are right.
Henrik