tags:

views:

129

answers:

6

Dear ladies and sirs.

The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible way using all the ASCII characters in the range 33-127. The naive implementation produces 22 characters, simply because 128 bits / 6 bits yields 22.

Huffman encoding is my second best, the only question is how to choose the codes....

Any more ideas?

Thanks.

P.S.

The encoding must be lossless, of course.

A: 

Simply go Base64.

leonbloy
A: 

Assuming that all of your GUIDs are being generated by the same algorithm, you can save 4 bits by not encoding the algorithm nibble, before applying any other encoding :-|

Damien_The_Unbeliever
+2  A: 

Using the full range from 33 (what's wrong wirh space, incidentally?) to 127 gives you 95 possible characters. Expressing the 2^128 possible values of guid in base 95 will use 20 characters. This (modulo things like dropping nybbles that will be constant) is the best you can do. Save yourself the trouble - use base 64.

AakashM
128 bits in base95 is 19.5 digits to be more accurate. If you consider the length free, you'll get that on average if omit most significant zeros.
Ants Aasma
A: 

An arbitrary GUID? The "naive" algorithm will produce optimal results. The only way to compress a GUID further is to make use of patterns in the data excluded by your "arbitrary" constraint.

jemfinch
Did you read the question? He's asking how to encode it more efficiently, not how to compress it.
Nick Johnson
Do you understand that encoding 128 bits more efficiently than 22 6-bit characters *requires* some form of compression and some pattern in the input?
jemfinch
+5  A: 

You have 95 characters available -- so, more than 6 bits, but not quite as many as 7 (about 6.57 actually). You could use 128/log2(95) = about 19.48 characters, to encode into 20 characters. If saving 2 characters in the encoded form is worth the loss of readability to you, something like (pseudocode):

char encoded[21];
long long guid;    // 128 bits number

for(int i=0; i<20; ++i) {
  encoded[i] = chr(guid % 95 + 33);
  guid /= 95;
}
encoded[20] = chr(0);

which is basically the generic "encode a number in some base" code, except that there's no need to reverse the "digits" since the order's arbitrary anyway (and little-endian is more direct and natural). To get back the guid from the encoded string is, in a very similar way, the polynomial computation in base 95 (after subtracting 33 from each digit of course):

guid = 0;

for(int i=0; i<20; ++i) {
  guid *= 95;
  guid += ord(encoded[i]) - 33;
}

essentially using Horner's approach to polynomial evaluation.

Alex Martelli
+5  A: 

Use Base 85. See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses

An IPv6 address, like a GUID is made up of eight 16-bit pieces.

Paul Butcher
Looks good, but the characters '%' and '?' are not suitable if the encoding is to be used in database queries. We replace them with ':' and '.', because we do not care about IPv6 formatting issues.
mark