views:

56

answers:

2

This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.

In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.

My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?

+5  A: 

In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as 0, everything else must start with 1 plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.

Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say, 0) while the other two might have 33% or lower probabilities and be coded as 10 and 11.

So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).

Amber
One exception is 3 symbols of equal probability: they can be encoded as 0, 10, 11
Artelius
Yeah, I was already adding that in as a note. :) It's an edge case, but potentially relevant.
Amber
There are other, related cases (1/3, 1/6, 1/6, 1/6, 1/6) but it's not true for all cases where one symbol has probability of 1/3. I'd love to see an answer that shows what the distinction is.
Artelius
Thanks, Amber, but I am not sure you reasoning is correct.For example, for 4 characters with frequencies (0.41, 0.2, 0.2, 0.19), I believe the encoding will (0, 10, 110, 111). This will give better compression than 4 two-bits characters, e.g. for 100 characters, 198 bits will be required instead of 200. However I am still not sure how to generalize this idea.
Daphna Shezaf
The reasoning is designed for general use cases, where the number of symbols is >>> 2^2, and thus most symbols will be encoded to more than 2-3 bits. For cases where the number of symbols is low, the "savings" from swapping from 2 to 1 bits for a character is relatively large even with lower frequencies, whereas for higher number of total symbols, the savings is lower unless the character occurs with a very high frequency. However, there is definitely a lower bound at 1/3 (except for the trivial case where the number of symbols is less than 3).
Amber
A: 

It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.

It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.

The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).

Daphna Shezaf