ansaurus

Question

Answer 1

+1 A:

The tree is generally created from a frequency table of the bytes. So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. This of course assumes that you're building the tree to represent single bytes, not larger blocks.

UPDATE: As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. They are combined and "bubble" upwards as you build the tree. This page describes the way a tree is built from the frequency table. As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree:

The easiest way to output the huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value.

unwind 2009-04-17 09:29:25

This would work, with only one downside and that is that it has a very bad worstcase scenario, if the string is AABCDEF the tree I am storing now is ABCDEF, basically I am back to square one, with the normal frequency found in the English language the size of the table I am storing would still cause the compression to be almost non-existent and in some cases even causing the data to be bigger than it was to begin with!

X-Istence 2009-04-17 09:32:52

Sure. One solution might be to just not compress input data shorter than 512 bytes or so. If this us used for disk-based data, that makes sense anyway, since most filesystems have overhead meaning that small files take up more actual disk space than they contain valid data for.

unwind 2009-04-17 09:39:07

I'm afraid you're going to have to realize that not all data is compressible. You're going to incur some overhead any way you do it, and if the data is small, or have a low entropy, the overhead might outweigh the gains of the compression algorithm. You should have a fallback encoding that doesn't compress for those scenarios.

Lasse V. Karlsen 2009-04-17 09:39:45

if the table is the one for the English language you don't need to store it for transmission. You just have to agree which one your implementation is going to use in that case.

Sam Hasler 2009-04-17 09:40:47

I *think* that you can't recreate a Huffman tree based on just the bytes sorted by frequency -- you need the actual frequencies, since different frequency vectors can lead to different Huffman trees even when the sorted order of the frequencies is the same. But I'm open to being proved wrong.

j_random_hacker 2009-04-17 09:41:16

That is correct, that's not enough, you need something more to get back the exact tree. See my post, which outputs extra bits to lay out the tree.

Lasse V. Karlsen 2009-04-17 09:44:37

@j_random_hacker: Thanks, I updated the answer.

unwind 2009-04-17 09:45:43

Answer 2

+2 A:

branches are 0 leaves are 1. Traverse the tree depth first to get its "shape"

e.g. the shape for this tree

0 - 0 - 1 (A)
|    \- 1 (E)
  \
    0 - 1 (C)
     \- 0 - 1 (B)
         \- 1 (D)

would be 001101011

Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). Then output the codes for the message. You then have a long series of bits that you can divide up into characters for output.

If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk.

Sam Hasler 2009-04-17 09:32:05

Answer 3

+13 A:

Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.

Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.

So for each node, starting at root:

If leaf-node: Output 1-bit + N-bit character/byte
If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way

To read, do this:

Read bit. If 1, then read N-bit character/byte, return new node around it with no children
If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value

A leaf-node is basically any node that doesn't have children.

With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurances.

Pseudo-code for calculation:

Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))

The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.

SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.

PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.

Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.

void EncodeNode(Node node, BitWriter writer)
{
    if (node.IsLeafNode)
    {
        writer.WriteBit(1);
        writer.WriteByte(node.Value);
    }
    else
    {
        writer.WriteBit(0);
        EncodeNode(node.LeftChild, writer);
        EncodeNode(node.Right, writer);
    }
}

To read it back in:

Node ReadNode(BitReader reader)
{
    if (reader.ReadBit() == 1)
    {
        return new Node(reader.ReadByte(), null, null);
    }
    else
    {
        Node leftChild = ReadNode(reader);
        Node rightChild = ReadNode(reader);
        return new Node(0, leftChild, rightChild);
    }
}

An example (simplified, use properties, etc.) Node implementation:

public class Node
{
    public Byte Value;
    public Node LeftChild;
    public Node RightChild;

    public Node(Byte value, Node leftChild, Node rightChild)
    {
        Value = value;
        LeftChild = leftChild;
        RightChild = rightChild;
    }

    public Boolean IsLeafNode
    {
        get
        {
            return LeftChild == null;
        }
    }
}

Here's a sample output from a specific example.

Input: AAAAAABCCCCCCDDEEEEE

Frequencies:

A: 6
B: 1
C: 6
D: 2
E: 5

Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.

The tree could look like this:

      20
  ----------
  |        8
  |     -------
 12     |     3
-----   |   -----
A   C   E   B   D
6   6   5   1   2

So the paths to each character is as follows (0 is left, 1 is right):

A: 00
B: 110
C: 01
D: 111
E: 10

So to calculate the output size:

A: 6 occurances * 2 bits = 12 bits
B: 1 occurance * 3 bits = 3 bits
C: 6 occurances * 2 bits = 12 bits
D: 2 occurances * 3 bits = 6 bits
E: 5 occurances * 2 bits = 10 bits

Sum of encoded bytes is 12+3+12+6+10 = 43 bits

Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.

The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.

001A1C01E01B1D 0000000000001100101010101011111111010101010

For the concrete example you have in the comments, AABCDEF, you will get this:

Input: AABCDEF

Frequencies:

A: 2
B: 1
C: 1
D: 1
E: 1
F: 1

Tree:

        7
  -------------
  |           4
  |       ---------
  3       2       2
-----   -----   -----
A   B   C   D   E   F
2   1   1   1   1   1

Paths:

A: 00
B: 01
C: 100
D: 101
E: 110
F: 111

Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes

Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.

Lasse V. Karlsen 2009-04-17 09:38:28

Note that the first calculation I added was incorrect for the size of the tree, have corrected it now and added a specific example to boot.

Lasse V. Karlsen 2009-04-17 10:16:50

Perhaps the best explanation I've seen. +1!

Stefan 2009-09-09 13:59:35

Answer 4

+1 A:

If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATE does, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character.

That is, if you had the tree/codes Lasse mentioned above:

A: 00
B: 110
C: 01
D: 111
E: 10

Then you could store those as: 2, 3, 2, 3, 2

And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.)

If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes.

If you want to get really crazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. Especially since they add extra codes for "insert zero N times in a row" to shorten things further.

The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt

Ezran 2009-06-02 22:04:28

ansaurus

tags:

views:

answers:

Efficient way of storing Huffman tree

related questions