views:

103

answers:

5

For a genetic algorithm application, I'm using a whole load of binary strings. Most of the time they literally take the form of 01001010110, so that they can be mated, mutated and "crossed-over".

For transport and storage however, this seems wasteful. What's the simplest way to encode this as a shorter string?

I'm guessing this is pretty trivial, but I'm not sure where to start looking.

Update: I actually need to end up with another string: one of the transport requests will be GET requests.

+1  A: 

I would just store them as an array of bytes and use a helper function to translate between the byte array version and the string version.

Scott Chamberlain
If the string itself is in ASCII form and there's no packing going on to fit 8 bits/byte (and instead just storing 1 bit per byte) this won't really save much space.
Joshua Rodgers
I meant store it 8 string intagers to one byte.
Scott Chamberlain
Would I be able to get a string representation of the byte array?
Tom Wright
Yes, any time you can convert back to a string, if you really need it.
slomojo
+2  A: 

What about converting it to it's base 10 integer equivalent?

int myBin = Convert.ToInt32("01001010110", 2);

Convert.ToInt32() documentation

Abe Miessler
You run the risk of issues of endienness and sign issues if you are transmitting between machines of different architectures. It would be safer to use bytes.
Scott Chamberlain
This isn't true, a string doesn't have endianness. Most significant bit is always on the left.
Hans Passant
+6  A: 

The simplest would be to take each digit and treat it as a bit. Each group of 8 bits can be stored in a byte. Then you can send it as a stream of bytes. You will also need to store the length of the original string so that you can distinguish between "0" and "00".

Here is one way you could write the conversion from string to a byte array:

byte[] convertToBytes(string s)
{
    byte[] result = new byte[(s.Length + 7) / 8];

    int i = 0;
    int j = 0;
    foreach (char c in s)
    {
        result[i] <<= 1;
        if (c == '1')
            result[i] |= 1;
        j++;
        if (j == 8)
        {
            i++;
            j = 0;
        }
    }
    return result;
}

Reversing the operation is very similar.

If you need to transmit the data as a string you can base 64 encode the resulting byte array.

You may also want to consider keeping it in this form in memory too. This will be much more efficient than storing it as a string where each digit is stored as a 2 byte character. You are using roughly 16 times more memory than you need to for storing your data. The disadvtange is that it is slightly more difficult to use in this form, so if you have enough memory then what you are currently doing might be just fine.

Mark Byers
A: 

Abe Miessler's answer is a good one, but with caveat mentioned in comments.

If 64 bits is not enough to represent your string, then consider using a BigInt class http://www.codeproject.com/KB/cs/BigInt.aspx (you would probably want to add to/fromBinary() extension methods to it. Alternatively represent this as a ... linked list of bytes.

Either approach has the problem of discarding any leading zeroes, so you would want to store the original length as well.

Hamish Grubijan
+1  A: 

Or implement Run length encoding or Huffman coding. Both are fairly easy to implement. RLE is by far the easiest, but will in most cases have worse compression ratio. If your data typically has many consecutive characters of the same value, it could still provide a substantial improvement.

Pedery