(In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length; because two bytes are more-or-less a single character.
The two "obvious" answers don't meet all the criteria:
string base64 = System.Convert.ToBase64String(bytes)
doesn't make very efficient use of the string since it only uses 64 characters out of roughly 60,000 available (my storage is a System.String). Going with
string utf16 = System.Text.Encoding.Unicode.GetString(bytes)
makes better use of the string, but it won't work for data that contains invalid Unicode characters (say mis-matched surrogate pairs). This MSDN article shows this exact (poor) technique.
Let's look at a simple example:
byte[] bytes = new byte[] { 0x41, 0x00, 0x31, 0x00};
string utf16 = System.Text.Encoding.Unicode.GetString(bytes);
byte[] utf16_bytes = System.Text.Encoding.Unicode.GetBytes(utf16);
In this case bytes and *utf16_bytes* are the same, because the orginal bytes were a UTF-16 string. Doing this same procedure with base64 encoding gives 16-member *base64_bytes* array.
Now, repeat the procedure with invalid UTF-16 data:
byte[] bytes = new byte[] { 0x41, 0x00, 0x00, 0xD8};
You'll find that *utf16_bytes* do not match the original data.
I've written code that uses U+FFFD as an escape before invalid Unicode characters; it works, but I'd like to know if there is a more standard technique than something I just cooked up on my own. Not to mention, I don't like catching the DecoderFallbackException as the way of detecting invalid characters.
I guess you could call this a "base BMP" or "base UTF-16" encoding (using all the characters in the Unicode Basic Multilingual Plane). Yes, ideally I'd follow Shawn Steele's advice and pass around byte[].
I'm going to go with Peter Housel's suggestion as the "right" answer because he's the only that came close to suggesting a "standard technique".
Edit base16k looks even better.