views:

2383

answers:

6

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.

That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.

Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.

Efficiency Goal - The same number of characters as the base-26 representation of the bits.

Non-solutions:

  1. If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.

  2. Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.

+3  A: 

Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.

Smashery
Given that this would be a huffman-decode scenario -- What would the huffman codes look like for 26 uniformly distributed values? I believe the variance in the number of bits consumed would result in widely different number of output characters, some far from the "base 26 representation" ideal.
They'd look like the table in Commodore Jaeger's solution. This is not a coincidence.
Steve Jessop
+1  A: 

Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.

Erik Forbes
It doesn't have to be that space-inefficient, mind.
Steve Jessop
True enough - though the concept of a lookup table stands, and it may be less efficient to use a smaller lookup table, even if that means the per-character efficiency is lower. Depends on the project requirements really.
Erik Forbes
+8  A: 

If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)

To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.

To decode letters into bits: For each letter, output the bit sequence in the table.

Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)

a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Commodore Jaeger
How efficient is this? The way I look at it, a-f will be output 37.5% of the time, capturing only 4 bits. On average, won't you get only 4.625 bits of input per output character? My solution is a constant 4.7 bits per output character.
erickson
+6  A: 

Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.

This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.

At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.

When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.

erickson
This seems an optimal solution for cases where the number of bits requested is an even multiple of 47.For other lengths, this would produce up to 9 extra characters (compared to the ideal representation)?
I updated my answer to describe a way to handle the end of input.
erickson
+1  A: 

If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.

A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.

If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

norheim.se
The goal is in representing the high entropy bit stream.With the 5.33 or 4.92 solutions, there would be 32-bit patterns which aren't representable?Does arithmetic encoding guarantee that a random bit sequence is represented uniquely in the output?
The "non representable patterns" are simply the bit waste. Arithmetic encoding is, I believe, optimal in the sense that it has minimum such waste. Any bit pattern in the encoded stream is decodable, and unless you are very close to the end of the stream, it should be uniquely represented.
norheim.se
I think this solution is backwards. He wants to encode a stream of bits as a stream of letters, not letters as bits.
Jamie
+3  A: 

Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.

number of letters    number of bits    bits/letter
 1                    4                4
 2                    9                4.5
 3                   14                4.67
 4                   18                4.5
 5                   23                4.6
 6                   28                4.67
 7                   32                4.57
 8                   37                4.63
 9                   42                4.67
10                   47                4.7
David Nehme