views:

938

answers:

6

I've implemented file compression using huffman's algorithm, but the problem I have is that to enable decompression of the compressed file, the coding tree used, or the codes itself should be written to the file too. The question is: how do i do that? What is the best way to write the coding tree at the beggining of the compressed file?

+1  A: 

Instead of writing the code tree to the file, write how often each character was found, so the decompression program can generate the same tree.

ammoQ
+4  A: 

There's a pretty standard implementation of Huffman Coding in the Basic Compression Library (BCL), including a recursive function that writes the tree out to a file. Look at huffman.C. It just writes out the leaves in order so the decoder can reconstruct the same tree.

BCL is also nice because there are some other pretty straightforward compression algorithm pieces in there, too. It's quite handy if you need to roll your own algorithm.

tgamblin
+1  A: 

The most naive solution would be to parse the compression tree in pre-order and write the 256 values in the header of your file.

Blindy
+2  A: 

Assuming you compress on 8-bit symbols (i.e. bytes) and the algorithm is non-adaptive, the simplest way would be to store not the tree but the distribution of the values. For example by storing how often you found byte 0, how often byte 1, ..., how often byte 255. Then when reading back the file you can re-assemble the tree. This is the simplest solution, but requires the most storage space (e.g. to cover large files, you would need 4 bytes per value, i.e. 1kb).

You could optimize this by not storing exactly how often each byte was found in the file, but instead normalizing the values to 0..255 (0 = found least, ...), in which case you would only need to save 256 bytes. Re-assembling of the tree based on these values would result in the same tree. (This is not going to work as pointed out by Edmund and in question 759707 - see there for further links and answers to your question)

P.S.: And as Henk said, using seek() allows you to keep space at the beginning of the file to store the values in later.

dseifert
ISTR with Huffman, the frequencies do actually impact the tree you get -- the rankings aren't enough. E.g. on input where A occurs much more than any other letters, you might expect 0 = A, but if A and B both occur fairly frequently, you'd get 00 = A and 01 = B.
Edmund
A: 

Since every node in a huffman tree is either a branch with two children, or a leaf, you can use a single bit to represent each node unambiguously. For a leaf, follow immediately with the 8 bits for that node.

e.g. for this tree:

    /\
   /\ A
  B /\
   C  D

You could store 001[B]01[C]1[D]1[A]

(Turns out this is exactly what happens in the huffman.c example posted earlier, but not how it was described above).

JimG
A: 

Most implementations are using canonical huffman encoding. You have only to store the symbol lengths in a compact way. Hier an implementation: shcodec. Another way is using a semi-static huffman encoding (periodic rescale), then you have not to store any tree.

bill