tags:

views:

515

answers:

8

I'm trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I'm having trouble coming up with the right algorithm for this.

The trade ID which I'm trying to compress consists of 2 fields:

  1. 18 alphanumeric characters consisting of the ASCII characters 0x20 to 0x7E, Inclusive. (32-126)
  2. A 3-character numeric string "000" to "999"

So a C++ class that would encompass this data looks like this:

class ID
{
public:
    char trade_num_[18];
    char broker_[3];
};

This data needs to be stored in a 16-char data structure, which looks like this:

class Compressed
{
public:
    char sku_[16];    
};

I tried to take advantage of the fact that since the characters in trade_num_ are only 0-127 there was 1 unused bit in each character. Similarly, 999 in binary is 1111100111, which is only 10 bits -- 6 bits short of a 2-byte word. But when I work out how much I can squeeze this down, the smallest I can make it is 17 bytes; one byte too big.

Any ideas?

By the way, trade_num_ is a misnomer. It can contain letters and other characters. That's what the spec says.

EDIT: Sorry for the confusion. The trade_num_ field is indeed 18 bytes and not 16. After I posted this thread my internet connection died and I could not get back to this thread until just now.

EDIT2: I think it is safe to make an assumption about the dataset. For the trade_num_ field, we can assume that the non-printable ASCII characters 0-31 will not be present. Nor will ASCII codes 127 or 126 (~). All the others might be present, including upper and lower case letters, numbers and punctuations. This leaves a total of 94 characters in the set that trade_num_ will be comprised of, ASCII codes 32 through 125, inclusive.

+28  A: 

If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.

>>> math.log(128**18 * 1000, 256)
16.995723035582763

You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.

>>> math.log(94**18 * 1000, 256)
15.993547951857446

This just fits into 16 bytes!


Example code

Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.

def encodeChar(c):
    return ord(c) - 32

def encode(s, n):
    t = 0
    for c in s:
        t = t * 94 + encodeChar(c)
    t = t * 1000 + n

    r = []
    for i in range(16):
        r.append(int(t % 256))
        t /= 256

    return r

print encode('                  ', 0)    # smallest possible value
print encode('abcdefghijklmnopqr', 123)
print encode('}}}}}}}}}}}}}}}}}}', 999)  # largest possible value

Output:

[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]
[ 59, 118, 192, 166, 108,  50, 131, 135, 174,  93,  87, 215, 177,  56, 170, 172]
[255, 255, 159, 243, 182, 100,  36, 102, 214, 109, 171,  77, 211, 183,   0, 247]

This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.

You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.

Mark Byers
There's probably a decent chance that the space character (or at least one of the punctuation characters) might be found to be unused in the input set.
Michael Burr
@Mark, @Michael: Space and punctuation characters are frequently used in this kind of data. Moreover the spec says quite clearly that any ASCII character from 0x00 to 0x7F can be used in the trade number field, but in my experience the non-printable characters are not used. Trade IDs are generally human-readable, and my examination of this data feed supports the same to be true for this feed as well. So I think this solution will work. I'll post my solution when I have the code written.
John Dibling
@Mark: If I take out 127 and 126 (the tilde) this yields 94 characters in the input set. But 93 in binary is 1011101, which is still 7 bits. Am I missing something here?
John Dibling
John: What you're missing is base-94. Consider: a number from 0-199, base 10. The first digit has two options, so needs one bit. The next digits have 10 options each, so four bits. Total of 9 bits. Yet you know we can fit numbers from 0-199 in a single octet. The key is that *there are not a certain number of bits reserved for each digit, rather the entire sequence is converted to a large number which is then encoded in binary*.
Ben Voigt
@John Dibling: yes, you really have to squeeze your data to make it fit; just chopping the eight bit wont suffice.
liori
@Ben Voigt: Yes that is exactly what I had in mind. Though I'm not sure if C++ has good support for large integers so it might require using an external library, or else a clever algorithm to avoid ever having to store such large numbers.
Mark Byers
@Ben, @Mark: OK, I *think* I understand what you're saying, and if so I came to a similar conclusion independently. The problem I then ran in to was the encoding part. Take a much-simplified example. 3 characters, which can be A B or C. A has a value of 1, B=2, C=3. How do I encode this string? I can't just add the values together because they won't be unique, and if I concatenate the binary representation (eg 01, 10, 11), the aggregate will take up no less space than the individual fields. I suspect I'm just not getting something here.
John Dibling
@John Dibling: I've provided example source code in Python which I hope you can understand to get you started. If you want to convert this algorithm directly to C++ you would need a a big integer library for C++, and for that you might want to look at this thread: http://stackoverflow.com/questions/124332/c-handling-very-large-integers.
Mark Byers
@Mark: Thanks, I'll take a look
John Dibling
@John Dibling: the only problem left is how sure you are that there will be no characters 0x00-0x1F and 0x7E-0x7F. If the standard says they can be there, there is a chance someone wanted to use that... then your code will be broken. You should at least have a check whether the characters are really in your allowed character set.
liori
What happened to `~` (0x7e)?
KennyTM
@KennyTM: If you include tilde then it doesn't fit in 16 bytes unfortunately :( ... see my first paragraph for an explanation of why.
Mark Byers
@liori, @KennyTM, et al: It became apparent to me that inn order to make this work at all, I have to make certain assumptions about the data beyond what the spec says. Although the spec says "any character from 0x00 to 0x7F" in my experience these are human-readable strings. Also, the tilde character is not seen in practice. I will of course add code to verify that my assumptions are true at run-time (and if not handle it gracefully), but for now I am going on the assumption that the set of valid characters will be 0x20-0x7D inclusive.
John Dibling
@Mark re: edit. Ah, I was wondering about that 84. For a few minutes there I thought I was just incredibly stupid. :)
John Dibling
@Mark: Re: `for i in range(16):` does this execute 16 or 17 times?
John Dibling
@John Dibling: 16 times.
Mark Byers
@John Dibling: Have you found a big number library? Would you also like me to post a corresponding decoding function written in Python so that you can translate it to C++?
Mark Byers
John Dibling
I'm accepting this answer and awarding the bounty to it since the question in combination with the comments and Mark's patience in answering my questions was very helpful.
John Dibling
+5  A: 

That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.

Assuming 8 bit = 1 byte

Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 2**6.555 = 94 different characters unless there is a hidden structure in trade_num that we could exploit to save more bits.

Luther Blissett
@Luther, As I said in my OP, the spec defines 'alphanumeric' as being any of the ASCII characters from 0x00 to 0x7F. This does not correlate exactly to what a C++ programmer naturally considers to be 'alphanumeric'
John Dibling
@Luther: Are you saying it can't be done?
John Dibling
If the values on trade_num are independent and evenly distributed, then yes.
Luther Blissett
A: 

If it can only contain letters, then you have less than 64 possibilities per character (26 upper case, 26 lower case, leaving you 12 for space, terminator, underscore, etc). With 6 bits per character, you should get there - in 15 characters. Assuming you don't support special characters.

EboMike
@EboMike: It can contain more than letters. Please see my revisions.
John Dibling
+1  A: 

You can do this in ~~15bytes (14 bytes and 6 bits).

For each character from trace_num_ you can save 1 bit if you want save ascii in 7 bits.

  • Then you have 2 bytes free and 2 bits, you must have 5.

Let get number information, each char can be one from ten values (0 to 9). Then you must have 4 bits to save this character, to save number you must have 1 byte and 4 bits, then you save half of this.

  • Now you have 3 bytes free and 6 bits, you must have 5.

If you want to use only qwertyuioplkjhgfdsazxcvbnmQWERTYUIOPLKJHGFDSAZXCVBNM1234567890[] You can save each char in 6 bits. Then you have next 2 bytes and 2 bits.

  • Now you have 6 bytes left, and your string can save in 15 bytes + nulltermination = 16bytes.

And if you save your number in integer on 10 bytes. You can fit this into 14 bytes and 6 bits.

Svisstack
@Svisstack: It can be more than the set of characters you suggest. My OP was quite clear on this before my edits, but I have made it even more clear now.
John Dibling
+1  A: 

Key questions are:

There appears to be some contradiction in your post whether the trade number is 16 or 18 characters. You need to clear that up. You say the total is 21 consisting of 16+3. :-(

You say the trade num characters are in the range 0x00-0x7f. Can they really be any character in that range, including tab, new line, control-C, etc? Or are they limited to printable characters, or maybe even to alphanumerics?

Does the output 16 bytes have to be printable characters, or is it basically a binary number?

EDIT, after updates to original post:

In that case, if the output can be any character in the character set, it's possible. If it can only be printable characters, it's not.

Demonstration of the mathematical possibility is straightforward enough. There are 94 possible values for each of 18 characters, and 10 possible values for each of 3. Total number of possible combinations = 94 ^ 18 * 10 ^ 3 ~= 3.28E35. This requires 128 bits. 2 ^127 ~= 1.70e38, which is too small, while 2^128 ~= 3.40e38, which is big enough. 128 bits is 16 bytes, so it will just barely fit if we can use every possible bit combination.

Given the tight fit, I think the most practical way to generate the value is to think of it as a double-long number, and then run the input through an algorithm to generate a unique integer for every possible input.

Conceptually, then, let's imagine we had a "huge integer" data type that is 16 bytes long. The algorithm would be something like this:

huge out;
for (int p=0;p<18;++p)
{
  out=out*94+tradenum[p]-32;
}
for (int p=0;p<3;++p)
{
  out=out*10+broker[p]-'0';
}

// Convert output to char[16]
unsigned char[16] out16;
for (int p=15;p>=0;--p)
{
  out16[p]=huge&0xff;
  huge=huge>>8;
}

return out16;

Of course we don't have a "huge" data type in C. Are you using pure C or C++? Isn't there some kind of big number class in C++? Sorry, I haven't done C++ in a while. If not, we could easily create a little library to implement a huge.

Jay
@Jay: OP edited
John Dibling
@Jay: OP edited again. Let us say the character set is limited to 0x20 - 0x7E Inclusive.
John Dibling
@John: See update
Jay
A: 

Use the first 10 bits for the 3-character numeric string (encode the bits like they represent a number and then pad with zeros as appropriate when decoding).

Okay, this leaves you with 118 bits and 16 alphanumeric characters to store.

0x00 to 0x7F (if you mean inclusive) comprises 128 possible characters to represent. That means that each character can be identified by a combination of 7 bits. Come up with an index mapping each number those 7 bits can represent to the actual character. To represent 16 of your "alphanumeric" characters in this way, you need a total of 112 bits.

We now have 122 bits (or 15.25 bytes) representing our data. Add an easter egg to fill in the remaining unused bits and you have your 16 character array.

Octoberdan
@Octoberdan: OP edited. The trade number field is 18 bytes, not 16. Sorry for the typo.
John Dibling
@Octoberdan: Please elaborate on the "index mapping" you refer to.
John Dibling
+2  A: 

This is something that should work, assuming you need only characters from allowedchars, and there is at most 94 characters there. This is python, but it is written trying not to use fancy shortcuts--so that you'll be able to translate it to your destination language easier. It assumes however that the number variable may contain integers up to 2**128--in C++ you should use some kind of big number class.

allowedchars=' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}'
alphabase = len(allowedchars)

def compress(code):
    alphanumeric = code[0:18]
    number = int(code[18:21])

    for character in alphanumeric:
        # find returns index of character on the allowedchars list
        number = alphabase*number + allowedchars.find(character)

    compressed = ''
    for i in xrange(16):
        compressed += chr(number % 256)
        number = number/256

    return compressed

def decompress(compressed):
    number = 0

    for byte in reversed(compressed):
        number = 256*number + ord(byte)

    alphanumeric = ''
    for i in xrange(18):
        alphanumeric = allowedchars[number % alphabase] + alphanumeric
        number = number/alphabase

    # make a string padded with zeros
    number = '%03d' % number

    return alphanumeric + number
liori
+1  A: 
KennyTM