views:

108

answers:

1

I'm currently developing an application in C# that uses Amazon SQS The size limit for a message is 8kb.

I have a method that is something like:

public void QueueMessage(string message)

Within this method, I'd like to first of all, compress the message (most messages are passed in as json, so are already fairly small)

If the compressed string is still larger than 8kb, I'll store it in S3.

My question is:

How can I easily test the size of a string, and what's the best way to compress it? I'm not looking for massive reductions in size, just something nice and easy - and easy to decompress the other end.

+7  A: 

To know the "size" (in kb) of a string we need to know the encoding. If we assume UTF8, then it is (not including BOM etc) like below (but swap the encoding if it isn't UTF8):

int len = Encoding.UTF8.GetByteCount(longString);

Re packing it; I would suggest GZIP via UTF8, optionally followed by base-64 if it has to be a string:

    using (MemoryStream ms = new MemoryStream())
    {
        using (GZipStream gzip = new GZipStream(ms, CompressionMode.Compress, true))
        {
            byte[] raw = Encoding.UTF8.GetBytes(longString);
            gzip.Write(raw, 0, raw.Length);
            gzip.Close();
        }
        byte[] zipped = ms.ToArray(); // as a BLOB
        string base64 = Convert.ToBase64String(zipped); // as a string
        // store zipped or base64
    }
Marc Gravell
Thanks.How do i determine the encoding? I haven't set this anywhere... i just serialize an object to json (using the json.net lib)
alex
Question: is the `gzip.Close()` call necessary, considering exiting the `using` block should close it anyway?
tzaman
@alex: You'd chose the encoding yourself when serializing the string to binary. As Marc says, UTF-8 is the best choice for size, since most characters occupy only one byte in this encoding.
Will Vousden
@tzaman - to be honest, not sure; but I *do* know that `GZipStream` keeps a buffer even if you `Flush()`, so it must be closed. The `using` may indeed suffice, so maybe I'm being explicit unnecessarily.
Marc Gravell
@Will - well, *generally* it is; there are some i18n occasions where UTF8 will be more expensive. But it is a reasonable default.
Marc Gravell
@alex - an encoding is the map between character data and bytes; this *might* be listed in the SQS/S3 documentation?
Marc Gravell