views:

9

answers:

0

I'm trying to learn more about compression. Suppose we have the following string:

double, double toil and trouble; fire burn, and caldron bubble.

The notation I use below is similar to that used for formal grammars. Capital letters are nonterminals, lowercase letters are terminals, and S is the start symbol.

Example of encoding with a "flat dictionary":

S = D, D toil A trouB; fire burn, A caldron bubB.
A = and
B = ble
D = double

In "flat dictionary" encoding, only the start symbol's definition may reference nonterminals. Other definitions must simply be strings.

Example of encoding with a "tree dictionary":

S = D, D toil A trouB; fire burn, A caldron bubB.
A = and
B = ble
D = douB

In "tree dictionary" encoding, any symbol's definition may have nonterminals. This helps because when making a symbol for double, we can take advantage of B = ble and say douB. Care must be taken to ensure definitions do not recurse.

Example of encoding with a "macro dictionary":

S = D, D toil A B(trou); fire burn, A caldron bubB.
A = and
B(x) = xble
D = B(dou)

The "macro dictionary" formalism is an extension of "tree dictionary" that allows for nonterminals with parameters. This is similar to macros in C. Care must be taken to ensure that there is no infinite recursion.

Note that some (many?) compression schemes do not really fit into any of these categories. For instance, although LZW encoding uses a dictionary, that dictionary is not actually emitted. Instead, the decoder unravels the dictionary.

Are there more widely accepted terms for what I call "flat dictionary", "tree dictionary", and "macro dictionary" encoding schemes?