views:

815

answers:

6

(In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length; because two bytes are more-or-less a single character.

The two "obvious" answers don't meet all the criteria:

string base64 = System.Convert.ToBase64String(bytes)

doesn't make very efficient use of the string since it only uses 64 characters out of roughly 60,000 available (my storage is a System.String). Going with

string utf16 = System.Text.Encoding.Unicode.GetString(bytes)

makes better use of the string, but it won't work for data that contains invalid Unicode characters (say mis-matched surrogate pairs). This MSDN article shows this exact (poor) technique.

Let's look at a simple example:

byte[] bytes = new byte[] { 0x41, 0x00, 0x31, 0x00};
string utf16 = System.Text.Encoding.Unicode.GetString(bytes);
byte[] utf16_bytes = System.Text.Encoding.Unicode.GetBytes(utf16);

In this case bytes and *utf16_bytes* are the same, because the orginal bytes were a UTF-16 string. Doing this same procedure with base64 encoding gives 16-member *base64_bytes* array.

Now, repeat the procedure with invalid UTF-16 data:

byte[] bytes = new byte[] { 0x41, 0x00, 0x00, 0xD8};

You'll find that *utf16_bytes* do not match the original data.

I've written code that uses U+FFFD as an escape before invalid Unicode characters; it works, but I'd like to know if there is a more standard technique than something I just cooked up on my own. Not to mention, I don't like catching the DecoderFallbackException as the way of detecting invalid characters.

I guess you could call this a "base BMP" or "base UTF-16" encoding (using all the characters in the Unicode Basic Multilingual Plane). Yes, ideally I'd follow Shawn Steele's advice and pass around byte[].


I'm going to go with Peter Housel's suggestion as the "right" answer because he's the only that came close to suggesting a "standard technique".


Edit base16k looks even better.

+1  A: 

You could treat the binary data as UTF-8b. The UTF-8b encoding assumes that the bytes are UTF-8 multibyte sequences, but has a fallback encoding for things that are not.

Peter S. Housel
UTF-8b looks very interesting. But won't the unpaired low surrogates result in an malformed UTF-16 string?
Dan
Exactly. But most things won't care, and will just pass the invalid UTF-16 through.
Peter S. Housel
Let's say this "legacy API" reasonably assumes it is getting a real string, so who knows what it will do with it.If malformed UTF-16 strings are OK, I could just "cast" the original byte[] to a string. Of course, unpaired surrogates are likely to be considered a rather "minor violation".
Dan
That predates the current (5.1.0) Unicode specification. The current specification is fairly firm about not accepting mal-formed UTF-8. It documents the valid ranges much more precisely than used to be the case. Also, the 5-byte and 6-byte sequences that could be devised are not used in Unicode.
Jonathan Leffler
UTF-8b isn't UTF-8. It's just a round-tripping codec for representing data that may or may not be UTF-8, through channels designed for UTF-16 or UCS-4... provided those channels don't look too closely at the data.
Peter S. Housel
That's why UTF-8b looks "pretty close", but it would be nice to have something that was a perfectly valid UTF-16 string. Is this problem really that obscure? Or have people just ignored the base64 inefficiency?
Dan
A: 

I fooled around with direct char arrays, and your one failing case works with my implementation. The code has been tested well: so do your tests first.

You could speed this up by using unsafe code. But I am sure UnicodeEncoding is just as slow (if not slower).

/// <summary>
/// Represents an encoding that packs bytes tightly into a string.
/// </summary>
public class ByteEncoding : Encoding
{
    /// <summary>
    /// Gets the Byte Encoding instance.
    /// </summary>
    public static readonly Encoding Encoding = new ByteEncoding();

    private ByteEncoding()
    {
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
    {
        for (int i = 0; i < chars.Length; i++)
        {
            // Work out some indicies.
            int j = i * 2;
            int k = byteIndex + j;

            // Get the bytes.
            byte[] packedBytes = BitConverter.GetBytes((short) chars[charIndex + i]);

            // Unpack them.
            bytes[k] = packedBytes[0];
            bytes[k + 1] = packedBytes[1];
        }

        return chars.Length * 2;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
    {
        for (int i = 0; i < byteCount; i += 2)
        {
            // Work out some indicies.
            int j = i / 2;
            int k = byteIndex + i;

            // Make sure we don't read too many bytes.
            byte byteB = 0;
            if (i + 1 < byteCount)
            {
                byteB = bytes[k + 1];
            }

            // Add it to the array.
            chars[charIndex + j] = (char) BitConverter.ToInt16(new byte[] { bytes[k], byteB }, 0);
        }

        return (byteCount / 2) + (byteCount % 2); // Round up.
    }

    public override int GetByteCount(char[] chars, int index, int count)
    {
        return count * 2;
    }

    public override int GetCharCount(byte[] bytes, int index, int count)
    {
        return (count / 2) + (count % 2);
    }

    public override int GetMaxByteCount(int charCount)
    {
        return charCount * 2;
    }

    public override int GetMaxCharCount(int byteCount)
    {
        return (byteCount / 2) + (byteCount % 2);
    }
}

Here is some test code:

    static void Main(string[] args)
    {
        byte[] original = new byte[256];

        // Note that we can't tell on the decode side how
        // long the array was if the original length is
        // an odd number. This will result in an
        // inconclusive result.
        for (int i = 0; i < original.Length; i++)
            original[i] = (byte) Math.Abs(i - 1);

        string packed = ByteEncoding.Encoding.GetString(original);
        byte[] unpacked = ByteEncoding.Encoding.GetBytes(packed);

        bool pass = true;

        if (original.Length != unpacked.Length)
        {
            Console.WriteLine("Inconclusive: Lengths differ.");
            pass = false;
        }

        int min = Math.Min(original.Length, unpacked.Length);
        for (int i = 0; i < min; i++)
        {
            if (original[i] != unpacked[i])
            {
                Console.WriteLine("Fail: Invalid at a position {0}.", i);
                pass = false;
            }
        }

        Console.WriteLine(pass ? "All Passed" : "Failure Present");

        Console.ReadLine();
    }

The test works, but you are going to have to test it with your API function.

Jonathan C Dickinson
A: 

There is another way to work around this limitation: although I am not sure how well it would work.

Firstly, you will need to figure out what type of string the API call is expecting - and what the structure of this string is. If I take a simple example, lets consider the .Net string:

  • Int32 _length;
  • byte[] _data;
  • byte _terminator = 0;

Add an overload to your API call, thus:

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(byte[] data);

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(string comment);

Then when you need to call the byte version you can do the following:

    public static void TheLegacyWisperer(byte[] data)
    {
        byte[] realData = new byte[data.Length + 4 /* _length */ + 1 /* _terminator */ ];
        byte[] lengthBytes = BitConverter.GetBytes(data.Length);
        Array.Copy(lengthBytes, realData, 4);
        Array.Copy(data, 0, realData, 4, data.Length);
        // realData[end] is equal to 0 in any case.
        MyLegacyFunction(realData);
    }
Jonathan C Dickinson
+7  A: 

May I suggest you do use base64? It may not be the most efficient way to do it storagewise, but it does have its benefits:

  1. Your worries about the code are over.
  2. You'll have the least compatibility problems with other players, if there are any.
  3. Should the encoded string every be considered as ASCII or ANSI during conversion, export, import, backup, restore, whatever, you won't have any problems either.
  4. Should you ever drop dead or end up under a bus or something, any programmer who ever gets their hands on the comment field will instantly know that it's base64 and not assume it's all encrypted or something.
Dave Van den Eynde
Point #4 is exactly why I'm looking for "a more standard technique than something I just cooked up on my own". Using just 6 bits with base64 seems like a waste when I KNOW the string is UTF-16 and has nearly 16 bits available ("nearly" because of bit sequences that aren't valid characters).
Dan
True, it may sound as a waste, but so is any other file saved as UTF-16 because most of us don't use the whole character space anyway. Besides if that legacy app is smart it would save as UTF-8 to a database or to a file, in which case it would have adverse effects if you plan to use all of Unicode.
Dave Van den Eynde
Perhaps the UTF-16 string always remains in memory; thus, it's (somewhat) important to use space efficiently. Memory is still relatively expensive compared to disk, especially with the more limited address space on 32-bit machine.
Dan
+3  A: 

Firstly, remember that Unicode doesn't mean 16 bits. The fact that System.String uses UTF-16 internally is neither here nor there. Unicode characters are abstract - they only gain bit representations through encodings.

You say "my storage is a System.String" - if that's the case, you cannot talk about bits and bytes, only Unicode characters. System.String certainly has it's own internal encoding, but (in theory) that could be different.

Incidentally, if you believe that the internal representation of System.String is too memory-inefficient for Base64-encoded data, why aren't you also worrying about Latin/Western strings?

If you want to store binary data in a System.String, you need a mapping between collections of bits and characters.

Option A: There's a pre-made one in the shape of Base64 encoding. As you've pointed out, this encodes six bits of data per character.

Option B: If you want to pack more bits per character, then you'll need to make an array (or encoding) of 128, 256, 512, etc Unicode characters, and pack 7, 8, 9, etc bits of data per character. Those characters need to be real Unicode characters.

To answer your question simply, yes there is a standard, it's Base64-encoding.

Is this a real problem? Do you have perf data to support your idea of not using Base64?

stusmith
It's memory usage. In the sample above, base64 uses 4 times as much memory as UTF-16: 16 bytes vs 4 bytes. The legacy application I'm working with already feels memory pressure on 32-bit systems, so I'd like to find an "easy" and "standard" way to be more efficient.
Dan
+4  A: 

I stumbled onto Base16k after reading your question. Not strictly a standard but it seems to work well and was easy enough to implement in C#.

Bingo! This looks to be almost EXACTLY what I was looking for; depending on your definition of "standard".
Dan