views:

275

answers:

2

A start-step-stop code is a data compression technique that is used to compress number that are relatively small.

The code works as follows: It has three parameters, start, step and stop. Start determines the amount of bits used to compute the first few numbers. Step determines how many bits to add to the encoding when we run out and stop determines the maximum amount of bits used to encode a number.

So the length of an encoding is given by l = start + step * i.

The "i" value of a particular code is encoded using unary. That is, a number of 1 bits followed by a terminating 0 bit. If we have reached stop then we can drop the terminating 0 bit. If i is zero we only write out the 0 bit.

So a (1, 2, 5) start-step-stop code would work as follows:

Value 0, encoded as: 0 0
Value 1, encoded as: 0 1
Value 2, encoded as: 10 000
Value 9, encoded as: 10 111
Value 10, encoded as: 11 00000
Value 41, encoded as: 11 11111

So, given a file containing several numbers, how can we compute the optimal start-step-stop codes for that file? The optimal parameters are defined as those that will result in the greatest compression ratio.

A: 

The approach I used was a simple brute force solution. The algorithm followed these basic steps:

  1. Count the frequency of each number in the file. In the same pass, compute the total amount of numbers in the file and determine the greatest number as maxNumber.

  2. Compute the probability of each number as its frequency divided by the total amount of numbers in the file.

  3. Determine "optimalStop" as equal to log2(maxNumber). This is the ideal number of bits that should be used to represent maxNumber as in Shannon information theory and therefore a reasonable estimate of the optimal maximum amount of bits used in the encoding of a particular number.

  4. For every "start" value from 1 to "optimalStop" repeat step 5 - 7:

  5. For every "step" value from 1 to ("optimalStop" - "start") / 2, repeat step 6 & 7:

  6. Calculate the "stop" value closest to "optimalStop" that satisfies stop = start + step * i for some integer i.

  7. Compute the average number of bits that would be used by this encoding. This can be calculated as each number's probability multiplied by its bit length in the given encoding.

  8. Pick the encoding with the lowest average number of bits.

fluffels
Stop may be higher than log2(maxNumber). You aren't accounting for prefix codes that could cause ambiguities. Also, even if you're stop was correct the runtime would be enourmous. Any file with more than a few symbols to encode would take to long to make this approach viable.
Ben S
Look again. Each code is prefixed by a unary encoding of its length. These have the property that they are not a prefix of any other unary encoding.Why do you think the runtime would be enourmous? Please motivate.
fluffels
Is there no example where a stop beyond log2(max) causes many numbers to be in a lower class, saving many bits, while only a few big ones need extra bits? Also, using probability is odd when you have exact data. Besides cum.freqs. for lookup of elems per class, I don't see/expect any improvement.
mweerden
If you multiply the probability of a number occurring by the amount of bits it needs to be encoded, and you sum these for all occurring numbers, you get the average amount of bits per symbol in the encoding. If you minimize this, you maximize compression.
fluffels
I thought about the case where the maximum stop needs to be higher. In the paper mentioned by Ben S above, the author iterates over possible start / step values and calculates stop. So I guess that would be an improvement.
fluffels
In my opinion you are misusing probability. An average is the sum of size*freq. for each symbol and dividing it by total sum of freqs. Also, you want minimality of the size of the compressed input file, which is average per symbol times number of symbols. Why not just take the sum of size*freq.?
mweerden
(Size * Freq) / TotalFreq = Size * Prob, since Prob = Freq / TotalFreq. Or am I missing something?
fluffels
True, but I don't get why you probabilities for this. To me it suggests that the probabilities have some significant relevance while you are just using it as a shorthand. A shorthand to calculate something you are not even really looking for. Perhaps I'm being pedantic and you should just ignore me.
mweerden
Well... as far as I understand it, the lower the average bits / symbol for a certain combination of start, step and stop, the better compression would be. So if you minimize this, you maximize compression.
fluffels
+3  A: 

These "start-step-stop" codes looks like a different way of calling Huffman codes. See the basic technique for an outline of the pseudo-code for calculating them.

Essentially this is what the algorithm does:

Before you start the Huffman encoding you need to gather the statistics of each symbol you'll be compressing (Their total frequency in the file to compress).

After you have that you create a binary tree using that info such that the most frequently used symbols are at the top of the tree (and thus use less bits) and such that no encoding has a prefix code. Since if an encoding has a common prefix there could be ambiguities decompressing.

At the end of the Huffman encoding your start value will be depth of the shallowest leaf node, your step will always be 1 (logically this makes sense, why would you force more bits than you need, just add one at a time,) and your stop value will be the depth of the deepest leaf node.

If the frequency stats aren't sorted it will take O(nlog n) to do, if they are sorted by frequency it can be done in O(n).

Huffman codes are guaranteed to have the best average compression for this type of encoding:

Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code.

This should help you implement the ideal solution to your problem.

Edit: Though similar, this isn't what the OP was looking for.

This academic paper by the creator of these codes describes a generalization of start-step-stop codes, start-stop codes. However, the author briefly describes how to get optimal start-step-stop near the end of section 2. It involves using a statistical random variable, or brute-force funding the best combination. Without any prior knowledge of the file the algorithm is O((log n)^3).

Hope this helps.

Ben S
Huffman is exactly what I thought of when reading this question. +1.
strager
It reminded me of my sophomore year data structures class. :D
Ben S
Huffman codes are not prefixed by their length, as in a start, step, stop code. Therefore, your solution does not apply.
fluffels
Hrm, I see the difference. I hadn't noticed your paragraph describing the string of 1 bits leading to the codes. I've added a link to a paper from the author of these codes to help you out.
Ben S
I came across this paper, but could not find an English version for some reason. Thanks!
fluffels
To be honest, I see very little difference between the author's approach and mine, except that he looped over all possible start and step values and calculated a stop for each. He just calculates the average code length using statistics and I used brute force.
fluffels
Marking this as the accepted answer. Thanks for the link to that paper!
fluffels