views:

142

answers:

4

I know .net supports base64 encoding of byte arrays. But i thought that i could save even more space if use a higher number of characters. I read somewhere that Unicode supports thousands of different characters so why not use base1024 encoding for example? And if this is possible can you give some guidelines on how to implement it. Thanks

+12  A: 

Base64 is there for a purpose: to store/transfer binary data in a format that fits in 6 bits/character to circumvent restrictions imposed by some protocols. If you don't have such a restriction, base64 is not for you. It's never designed for saving space. If you need to save space and you are free to use anything, then simply store the array as binary data.

Mehrdad Afshari
+1 Beat me to it....
jdv
It's for storing data in a URL. Base64 just makes the URL too long.
diamandiev
@diamandiev: Why would it matter? Shorter length doesn't necessarily imply smaller size.
Mehrdad Afshari
It's easier for the customer. and also there are limitations on url length.
diamandiev
@diamandiev: You are probably relying on the URL to store too much data. If you are close to reaching the real world limits of a URL, you should consider storing data elsewhere. Besides, I can't see how it helps a customer to see a bunch of random unreadable Unicode characters (and characters that some fonts lack) along with some East Asian characters instead of base64.
Mehrdad Afshari
+3  A: 

The point of base64 is to avoid encoding issues. Practically all machines still running agree on the ASCII character set. Although there's probably still a few EBCDIC machines out there consuming kilowatts. ASCII only encodes 96 unambiguous characters. Base64 uses 64 of those, plus a padding character. Base128 is already too much.

There's nothing unambiguous about Unicode, common encodings in use are UTF7, UTF8, UTF16, UTF32, UCS-2 and their least-endian and big-endian varieties. Base1024 would require 1024 unambiguous characters, way too much for anybody to ever agree on. Note that it can't just be an encoded range, the Unicode charts have lots of holes in them and they are randomly distributed.

Hans Passant
I wrote a simple driver for a plotter that spoke EBCDIC (actually about 8 different variants of it), around 2002 or so. I think it's likely there are still EBCDIC devices floating around that even their users don't know about.
Ken
A: 

As the others already mentioned, base64 doesn't save any space. It even blows up the number of characters needed to contain the same informations (take a look at wikipedia to see that three bytes needs four characters for representation).

If you really need to save some space and want to compress a byte array you should take a look into the LZMA algorithm. And if you need an implementation of this algorithm in C, C++, C# or Java take a look at the 7zip page.

Oliver
I believe .Net already has a built in implementation of the LZMA algorithm, it's in the System.IO namespace and is called something along the lines of CompressedStream (can't remember exactly, but if you look through the various classes in the namespace it should stick out)
Grant Peters
@Grant Peters: I didn't work with this stuff. So i never searched for such a thing within .Net framework and just had in my mind that 7zip has a C# implementation. But good to know for the future.
Oliver
On further investigation, its actually in the 'System.IO.Compression' namespace and is currently both the 'gzip' and 'deflate' compression formats are availible (both use the same compression algorithm, but the gzip includes extra data for things like crc checks)
Grant Peters
A: 

Depending on whether you use 2 byte Unicode encoding (UCS2) or multi byte (UTF-8). Base 1024 would be only slightly better or even more wasteful of space than base64, since base 64 uses 6 bits out of an 8 bit byte. Raw binary data converted to base64 becomes 4/3 larger. (about 1.333x growth)

But base1024 using UCS-2 (16 bit) Unicode characters would use only 10 of 16 bits, so it would take 8/5 the space. raw binary data converted to base1024 using UCS-2 would grow to 1.6 times its original size. This is worse than base64.

If you used UTF-8 Unicode instead, and were careful to use only unicode characters that had 1 or 2 byte encodings, you could get at most 1920 more unique code points out of 2 characters, which works out to a slight improvement in data density. (UTF-8 encoding only uses 6 bits of each additional * bit byte to indicate code points, the other 2 bits are used to indicate that there are more bytes to follow)

So this is not going to help, You should look into the possibility of compressing on your data before converting it to base64.

John Knoeller