views:

53

answers:

2

I am implementing run length encoding using the GZipStream class in a C# winforms app.

Data is provided as a series of strings separated by newline characters, like this:

FFFFFFFF
FFFFFEFF
FDFFFFFF
00FFFFFF

Before compressing, I convert the string to a byte array, but doing so fails if newline characters are present.

Each newline is significant, but I am not sure how to preserve their position in the encoding.

Here is the code I am using to convert to a byte array:

private static byte[] HexStringToByteArray(string _hex)
{
    _hex = _hex.Replace("\r\n", "");
    if (_hex.Length % 2 != 0) throw new FormatException("Hex string length must be divisible by 2.");
    int l = _hex.Length / 2;
    byte[] b = new byte[l];
    for (int i = 0; i < l; i++)
    b[i] = Convert.ToByte(_hex.Substring(i * 2, 2), 16);
    return b;
}

Convert.ToByte throws a FormatException if the newlines are not removed, with the info: "Additional non-parsable characters are at the end of the string." Which doesn't surprise me.

What would be the best way to make sure newline characters can be included properly?

Note I should add that the compressed version of this string must itself be a string that can be included in an XML document.

Edit:

I have tried to simply convert the string to a byte array without performing any binary conversion on it, but am still having trouble with the compression. Here are the relevant methods:

    private static byte[] StringToByteArray(string _s)
    {
        Encoding enc = Encoding.ASCII;
        return enc.GetBytes(_s);
    }

    public static byte[] Compress(byte[] buffer)
    {
        MemoryStream ms = new MemoryStream();
        GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true);
        zip.Write(buffer, 0, buffer.Length);
        zip.Close();
        ms.Position = 0;

        byte[] compressed = new byte[ms.Length];
        ms.Read(compressed, 0, compressed.Length);

        byte[] gzBuffer = new byte[compressed.Length + 4];
        Buffer.BlockCopy(compressed, 0, gzBuffer, 4, compressed.Length);
        Buffer.BlockCopy(BitConverter.GetBytes(buffer.Length), 0, gzBuffer, 0, 4);
        return gzBuffer;
    }
+2  A: 

Firstly: are you certain that just compressing the text doesn't give much the same result as compressing the "converted to binary" form?

Assuming you want to go ahead with converting to binary, I can suggest two options:

  • At the start of each line, write a number stating how many bytes are in the line. Then when you decompress, you read and convert that many bytes, then write a newline. If you know that each line is always going to be less than 256 bytes long, you can just represent this as a single byte. Otherwise you might want a larger fixed size, or some variable size encoding (e.g. "while the top bit is set, this is still part of the number") - the latter gets hairy pretty quickly.
  • Alternatively, "escape" a newline by representing it as (say) 0xFF, 0x00. You'd then also need to escape a genuine 0xFF as (say) 0xFF 0xFF. When you read the data, if you read an 0xFF you'd then read the next byte to determine whether it represented a newline or a genuine 0xFF.

EDIT: I believe your original approach was fundamentally flawed. Whatever you get out of GZipStream is not text, and shouldn't be treated as if it were text using Encoding. However, you can turn it into ASCII text very easily, by calling Convert.ToBase64String. By the way, another trick you've missed is to call ToArray on the MemoryStream, which will give you the contents as a byte[] with no extra messing around.

Jon Skeet
I think I've added complexity by trying to convert to binary instead of just converting to a byte array, however I have not had luck with the code I will append to my question.
JYelton
I've updated the question - I think this worked with the binary conversion because it somehow ensured all compressed bytes were ascii-printable characters. If I simply convert the string to a byte array then compress it, the resulting bytes are outside of the printable characters and thus why I am unable to decode it.
JYelton
@JYelton: In that case, there's a much simpler answer. Editing...
Jon Skeet
@Jon: Getting there! Your suggestions are working, however there remains the last problem of converting the compressed base64 string back to its original form. Also I am not sure where I can eliminate some code by using `MemoryStream.ToArray()`
JYelton
@JYelton: The base64 string isn't compressed... it's a base64 string *of the compressed data*. As for reversing it: use `Convert.FromBase64String` to get back to the compressed binary data. Uncompress that and you've got the original data. And `MemoryStream.ToArray` simplifies this code: `byte[] compressed = new byte[ms.Length]; ms.Read(compressed, 0, compressed.Length);`
Jon Skeet
@Jon: The new implementation is complete and works great. It involved creating four small helper methods that convert to and from byte arrays and strings; and byte arrays and Base64 encoded strings. Thanks for the guidance as always.
JYelton
A: 

If the data you posted is representative of all the data, then you have a newline every 4 bytes, so if you need it when converting back, just stick one in every 4 bytes of data

BioBuckyBall
Unfortunately it is greatly simplified, most lines will be about 80 bytes long, or 40 2-character hex strings. It is variable though.
JYelton
@JYelton too bad, variable length obviously makes it harder :(
BioBuckyBall
It would be cake if that were the case! I would just continue stripping newlines and re-create them on the other side if the lengths were static.
JYelton