views:

473

answers:

2

Hi all

I was recently asked to complete a task for a c++ role, however as the application was decided not to be progressed any further I thought that I would post here for some feedback / advice / improvements / reminder of concepts I've forgotten.

The task was:

The following data is a time series of integer values

int timeseries[32] = {67497, 67376, 67173, 67235, 67057, 67031, 66951,
66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044,
67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620,
66579, 66596, 66713, 66852, 66715};

The series might be, for example, the closing price of a stock each day over a 32 day period.

As stored above, the data will occupy 32 x sizeof(int) bytes = 128 bytes assuming 4 byte ints.

Using delta encoding , write a function to compress, and a function to uncompress data like the above.

Ok, so before this point I had never looked at compression so my solution is far from perfect. The manner in which I approached the problem is by compressing the array of integers into a array of bytes. When representing the integer as a byte I keep the calculate most significant byte (msb) and keep everything up to this point, whilst throwing the rest away. This is then added to the byte array. For negative values I increment the msb by 1 so that we can differentiate between positive and negative bytes when decoding by keeping the leading 1 bit values.

When decoding I parse this jagged byte array and simply reverse my previous actions performed when compressing. As mentioned I have never looked at compression prior to this task so I did come up with my own method to compress the data. I was looking at C++/Cli recently, had not really used it previously so just decided to write it in this language, no particular reason. Below is the class, and a unit test at the very bottom. Any advice / improvements / enhancements will be much appreciated.

Thanks.

array<array<Byte>^>^ CDeltaEncoding::CompressArray(array<int>^ data)
{
    int temp = 0;
    int original;
    int size = 0;

    array<int>^ tempData = gcnew array<int>(data->Length);
    data->CopyTo(tempData, 0);

    array<array<Byte>^>^ byteArray = gcnew array<array<Byte>^>(tempData->Length);

    for (int i = 0; i < tempData->Length; ++i)
    {
            original = tempData[i];
            tempData[i] -= temp;
            temp = original;

            int msb = GetMostSignificantByte(tempData[i]);

            byteArray[i] = gcnew array<Byte>(msb);
            System::Buffer::BlockCopy(BitConverter::GetBytes(tempData[i]), 0, byteArray[i], 0, msb );

            size += byteArray[i]->Length;
    }

    return byteArray;
}

array<int>^ CDeltaEncoding::DecompressArray(array<array<Byte>^>^ buffer)
{
    System::Collections::Generic::List<int>^ decodedArray = gcnew System::Collections::Generic::List<int>();

    int temp = 0;

    for (int i = 0; i < buffer->Length; ++i) 
    {
            int retrievedVal = GetValueAsInteger(buffer[i]);
            decodedArray->Add(retrievedVal);

            decodedArray[i] += temp;
            temp = decodedArray[i];
    }

    return decodedArray->ToArray();
}

int CDeltaEncoding::GetMostSignificantByte(int value)
{
    array<Byte>^ tempBuf = BitConverter::GetBytes(Math::Abs(value));

    int msb = tempBuf->Length;

    for (int i = tempBuf->Length -1; i >= 0; --i)
    {
            if (tempBuf[i] != 0)
            {
                    msb = i + 1;
                    break;
            }
    }

    if (!IsPositiveInteger(value))
    {
            //We need an extra byte to differentiate the negative integers
            msb++;
    }

    return msb;
}

bool CDeltaEncoding::IsPositiveInteger(int value)
{
    return value / Math::Abs(value) == 1;
}

int CDeltaEncoding::GetValueAsInteger(array<Byte>^ buffer)
{
    array<Byte>^ tempBuf;

    if(buffer->Length % 2 == 0)
    {
            //With even integers there is no need to allocate a new byte array
            tempBuf = buffer;
    }
    else
    {
            tempBuf = gcnew array<Byte>(4);
            System::Buffer::BlockCopy(buffer, 0, tempBuf, 0, buffer->Length );

            unsigned int val = buffer[buffer->Length-1] &= 0xFF;

            if ( val == 0xFF )
            {
                    //We have negative integer compressed into 3 bytes
                    //Copy over the this last byte as well so we keep the negative pattern
                    System::Buffer::BlockCopy(buffer, buffer->Length-1, tempBuf, buffer->Length, 1 );
            }
    }

    switch(tempBuf->Length)
    {
    case sizeof(short):
            return BitConverter::ToInt16(tempBuf,0);
    case sizeof(int):
    default:
            return BitConverter::ToInt32(tempBuf,0);
    }
}

And then in a test class I had:

void CTestDeltaEncoding::TestCompression()
{
    array<array<Byte>^>^ byteArray = CDeltaEncoding::CompressArray(m_testdata);

    array<int>^ decompressedArray = CDeltaEncoding::DecompressArray(byteArray);

    int totalBytes = 0;
    for (int i = 0; i<byteArray->Length; i++)
    {
            totalBytes += byteArray[i]->Length;
    }

    Assert::IsTrue(m_testdata->Length * sizeof(m_testdata) > totalBytes, "Expected the total bytes to be less than the original array!!");

    //Expected totalBytes = 53
}
+3  A: 

This smells a lot like homework to me. The crucial phrase is: "Using delta encoding."

Delta encoding means you encode the delta (difference) between each number and the next:

67497, 67376, 67173, 67235, 67057, 67031, 66951, 66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044, 67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620, 66579, 66596, 66713, 66852, 66715

would turn into:

[Base: 67497]: -121, -203, +62

and so on. Assuming 8-bit bytes, the original numbers require 3 bytes apiece (and given the number of compilers with 3-byte integer types, you're normally going to end up with 4 bytes apiece). From the looks of things, the differences will fit quite easily in 2 bytes apiece, and if you can ignore one (or possibly two) of the least significant bits, you can fit them in one byte apiece.

Delta encoding is most often used for things like sound encoding where you can "fudge" the accuracy at times without major problems. For example, if you have a change from one sample to the next that's larger than you've left space to encode, you can encode a maximum change in the current difference, and add the difference to the next delta (and if you don't mind some back-tracking, you can distribute some to the previous delta as well). This will act as a low-pass filter, limiting the gradient between samples.

For example, in the series you gave, a simple delta encoding requires ten bits to represent all the differences. By dropping the LSB, however, nearly all the samples (all but one, in fact) can be encoded in 8 bits. That one has a difference (right shifted one bit) of -173, so if we represent it as -128, we have 45 left. We can distribute that error evenly between the preceding and following sample. In that case, the output won't be an exact match for the input, but if we're talking about something like sound, the difference probably won't be particularly obvious.

Jerry Coffin
+1  A: 

I did mention that it was an exercise that I had to complete and the solution that I received was deemed not good enough, so I wanted some constructive feedback seeing as actual companies never decide to tell you what you did wrong.

When the array is compressed I store the differences and not the original values except the first as this was my understanding. If you had looked at my code I have provided a full solution but my question was how bad was it?

Zahir