views:

147

answers:

3

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).

The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.

Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.

A: 

You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

Tilka
That being said, I think the name of the algorithm is "coding translation" or something like that.
Tilka
I think it's a particular form of "entropy coding", don't ask me which though.
Tilka
+2  A: 

Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.

I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.

Nate
You forgot to mention arithmetic coding :) Actually, it would be a good solution when cardinality of the target alphabet is not a power of 2.
ruslik
Absolutely!! If you aren't worried about data corruption, the compression ratio of arithmetic coding is even **better** at approaching your entropy than Huffman coding. EDIT - although, you wouldn't have a fixed target alphabet, then. I got so excited there for a second, I forgot the context :)
Nate
Yeah, for small buffers transmitting the context will be overkill.
ruslik
The target alphabet typically doesn't have a fixed width, just a limit - the user needs to select their source alphabet in order to tune the limit as necessary. At least thats the way I invisage it being used.
Jon Freedman
+1  A: 

Here is a simple algorithm:

Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.

In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.

EDITED example:

input alpabet: ABC, output alphabet: 0123456789

message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.

11 to base 10: 11

We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:

1) encode somehow in the stream the size of compressed data.

2) use a dummy 1 at the start of the stream: in this way our example will become:

10102 (base 3) = 81 + 9 + 2 = 92 (base 10).

Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).

The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).

ruslik
You've hit the main problem I've encountered, how to differentiate `01` from `1` - my algorithm so far is based on long division, but is not quite there yet.
Jon Freedman