ansaurus

Question

Is there a standard technique for packing binary data into a UTF-16 string?

Answer 1

+1 A:

You could treat the binary data as UTF-8b. The UTF-8b encoding assumes that the bytes are UTF-8 multibyte sequences, but has a fallback encoding for things that are not.

Peter S. Housel 2009-03-15 03:30:51

UTF-8b looks very interesting. But won't the unpaired low surrogates result in an malformed UTF-16 string?

Dan 2009-03-15 03:52:35

Exactly. But most things won't care, and will just pass the invalid UTF-16 through.

Peter S. Housel 2009-03-15 04:09:32

Let's say this "legacy API" reasonably assumes it is getting a real string, so who knows what it will do with it.If malformed UTF-16 strings are OK, I could just "cast" the original byte[] to a string. Of course, unpaired surrogates are likely to be considered a rather "minor violation".

Dan 2009-03-15 04:17:41

That predates the current (5.1.0) Unicode specification. The current specification is fairly firm about not accepting mal-formed UTF-8. It documents the valid ranges much more precisely than used to be the case. Also, the 5-byte and 6-byte sequences that could be devised are not used in Unicode.

Jonathan Leffler 2009-03-17 02:51:47

UTF-8b isn't UTF-8. It's just a round-tripping codec for representing data that may or may not be UTF-8, through channels designed for UTF-16 or UCS-4... provided those channels don't look too closely at the data.

Peter S. Housel 2009-03-17 20:37:59

That's why UTF-8b looks "pretty close", but it would be nice to have something that was a perfectly valid UTF-16 string. Is this problem really that obscure? Or have people just ignored the base64 inefficiency?

Dan 2009-03-18 17:41:33

Answer 2

A:

I fooled around with direct char arrays, and your one failing case works with my implementation. The code has been tested well: so do your tests first.

You could speed this up by using unsafe code. But I am sure UnicodeEncoding is just as slow (if not slower).

/// <summary>
/// Represents an encoding that packs bytes tightly into a string.
/// </summary>
public class ByteEncoding : Encoding
{
    /// <summary>
    /// Gets the Byte Encoding instance.
    /// </summary>
    public static readonly Encoding Encoding = new ByteEncoding();

    private ByteEncoding()
    {
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
    {
        for (int i = 0; i < chars.Length; i++)
        {
            // Work out some indicies.
            int j = i * 2;
            int k = byteIndex + j;

            // Get the bytes.
            byte[] packedBytes = BitConverter.GetBytes((short) chars[charIndex + i]);

            // Unpack them.
            bytes[k] = packedBytes[0];
            bytes[k + 1] = packedBytes[1];
        }

        return chars.Length * 2;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
    {
        for (int i = 0; i < byteCount; i += 2)
        {
            // Work out some indicies.
            int j = i / 2;
            int k = byteIndex + i;

            // Make sure we don't read too many bytes.
            byte byteB = 0;
            if (i + 1 < byteCount)
            {
                byteB = bytes[k + 1];
            }

            // Add it to the array.
            chars[charIndex + j] = (char) BitConverter.ToInt16(new byte[] { bytes[k], byteB }, 0);
        }

        return (byteCount / 2) + (byteCount % 2); // Round up.
    }

    public override int GetByteCount(char[] chars, int index, int count)
    {
        return count * 2;
    }

    public override int GetCharCount(byte[] bytes, int index, int count)
    {
        return (count / 2) + (count % 2);
    }

    public override int GetMaxByteCount(int charCount)
    {
        return charCount * 2;
    }

    public override int GetMaxCharCount(int byteCount)
    {
        return (byteCount / 2) + (byteCount % 2);
    }
}

Here is some test code:

    static void Main(string[] args)
    {
        byte[] original = new byte[256];

        // Note that we can't tell on the decode side how
        // long the array was if the original length is
        // an odd number. This will result in an
        // inconclusive result.
        for (int i = 0; i < original.Length; i++)
            original[i] = (byte) Math.Abs(i - 1);

        string packed = ByteEncoding.Encoding.GetString(original);
        byte[] unpacked = ByteEncoding.Encoding.GetBytes(packed);

        bool pass = true;

        if (original.Length != unpacked.Length)
        {
            Console.WriteLine("Inconclusive: Lengths differ.");
            pass = false;
        }

        int min = Math.Min(original.Length, unpacked.Length);
        for (int i = 0; i < min; i++)
        {
            if (original[i] != unpacked[i])
            {
                Console.WriteLine("Fail: Invalid at a position {0}.", i);
                pass = false;
            }
        }

        Console.WriteLine(pass ? "All Passed" : "Failure Present");

        Console.ReadLine();
    }

The test works, but you are going to have to test it with your API function.

Jonathan C Dickinson 2009-03-19 10:04:33

Answer 3

A:

There is another way to work around this limitation: although I am not sure how well it would work.

Firstly, you will need to figure out what type of string the API call is expecting - and what the structure of this string is. If I take a simple example, lets consider the .Net string:

Int32 _length;
byte[] _data;
byte _terminator = 0;

Add an overload to your API call, thus:

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(byte[] data);

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(string comment);

Then when you need to call the byte version you can do the following:

    public static void TheLegacyWisperer(byte[] data)
    {
        byte[] realData = new byte[data.Length + 4 /* _length */ + 1 /* _terminator */ ];
        byte[] lengthBytes = BitConverter.GetBytes(data.Length);
        Array.Copy(lengthBytes, realData, 4);
        Array.Copy(data, 0, realData, 4, data.Length);
        // realData[end] is equal to 0 in any case.
        MyLegacyFunction(realData);
    }

Jonathan C Dickinson 2009-03-19 10:26:13

Answer 4

+7 A:

May I suggest you do use base64? It may not be the most efficient way to do it storagewise, but it does have its benefits:

Your worries about the code are over.
You'll have the least compatibility problems with other players, if there are any.
Should the encoded string every be considered as ASCII or ANSI during conversion, export, import, backup, restore, whatever, you won't have any problems either.
Should you ever drop dead or end up under a bus or something, any programmer who ever gets their hands on the comment field will instantly know that it's base64 and not assume it's all encrypted or something.

Dave Van den Eynde 2009-03-19 14:43:18

Point #4 is exactly why I'm looking for "a more standard technique than something I just cooked up on my own". Using just 6 bits with base64 seems like a waste when I KNOW the string is UTF-16 and has nearly 16 bits available ("nearly" because of bit sequences that aren't valid characters).

Dan 2009-03-19 15:08:46

True, it may sound as a waste, but so is any other file saved as UTF-16 because most of us don't use the whole character space anyway. Besides if that legacy app is smart it would save as UTF-8 to a database or to a file, in which case it would have adverse effects if you plan to use all of Unicode.

Dave Van den Eynde 2009-03-19 16:16:34

Perhaps the UTF-16 string always remains in memory; thus, it's (somewhat) important to use space efficiently. Memory is still relatively expensive compared to disk, especially with the more limited address space on 32-bit machine.

Dan 2009-03-19 21:44:00

Answer 5

+3 A:

Firstly, remember that Unicode doesn't mean 16 bits. The fact that System.String uses UTF-16 internally is neither here nor there. Unicode characters are abstract - they only gain bit representations through encodings.

You say "my storage is a System.String" - if that's the case, you cannot talk about bits and bytes, only Unicode characters. System.String certainly has it's own internal encoding, but (in theory) that could be different.

Incidentally, if you believe that the internal representation of System.String is too memory-inefficient for Base64-encoded data, why aren't you also worrying about Latin/Western strings?

If you want to store binary data in a System.String, you need a mapping between collections of bits and characters.

Option A: There's a pre-made one in the shape of Base64 encoding. As you've pointed out, this encodes six bits of data per character.

Option B: If you want to pack more bits per character, then you'll need to make an array (or encoding) of 128, 256, 512, etc Unicode characters, and pack 7, 8, 9, etc bits of data per character. Those characters need to be real Unicode characters.

To answer your question simply, yes there is a standard, it's Base64-encoding.

Is this a real problem? Do you have perf data to support your idea of not using Base64?

stusmith 2009-03-21 13:16:40

It's memory usage. In the sample above, base64 uses 4 times as much memory as UTF-16: 16 bytes vs 4 bytes. The legacy application I'm working with already feels memory pressure on 32-bit systems, so I'd like to find an "easy" and "standard" way to be more efficient.

Dan 2009-03-21 16:35:57

Answer 6

+4 A:

I stumbled onto Base16k after reading your question. Not strictly a standard but it seems to work well and was easy enough to implement in C#.

2009-04-27 15:08:18

Bingo! This looks to be almost EXACTLY what I was looking for; depending on your definition of "standard".

Dan 2009-04-28 14:52:26

ansaurus

tags:

views:

answers:

Is there a standard technique for packing binary data into a UTF-16 string?

related questions