views:

445

answers:

6

I have an array of integers, lets assume they are of type int64_t. Now, I know that only every first n bits of every integer are meaningful (that is, I know that they are limited by some bounds).

What is the most efficient way to convert the array in the way that all unnecessary space is removed (i.e. I have the first integer at a[0], the second one at a[0] + n bits and so on) ?

I would like it to be general as much as possible, because n would vary from time to time, though I guess there might be smart optimizations for specific n like powers of 2 or sth.

Of course I know that I can just iterate value over value, I just want to ask you StackOverflowers if you can think of some more clever way.

Edit:

This question is not about compressing the array to take as least space as possible. I just need to "cut" n bits from every integer and given the array I know the exact n of bits I can safely cut.

+2  A: 

I know this might seem like the obvious thing to say as I'm sure there's actually a solution, but why not use a smaller type, like uint8_t (max 255)? or uint16_t (max 65535)?. I'm sure you could bit-manipulate on an int64_t using defined values and or operations and the like, but, aside from an academic exercise, why?

And on the note of academic exercises, Bit Twiddling Hacks is a good read.

Ninefingers
+1 for cool link. Well, this can sometimes be int64_t with, say, 49 bits useful. So using smaller type in not an option.
pajton
+5  A: 

Most any compression algorithm will get close to the minimum entropy needed to encode the integers, for example, Huffman coding, but accessing it like an array will be non-trivial.

keraba
Interesting idea, +1.
Ninefingers
The point is I'd like to write it later to a file, so I need to bitpack it first to save disk space.
pajton
If you want to minimize disk usage, you should look for a compression library instead of rolling your own.
Georg Fritzsche
Well, I am actually sort of rolling my own, hence the question:).
pajton
+3  A: 

I agree with keraba that you need to use something like Huffman coding or perhaps the Lempel-Ziv-Welch algorithm. The problem with bit-packing the way you are talking about is that you have two options:

  • Pick a constant n such that the largest integer can be represented.
  • Allow n to vary from value to value.

The first option is relatively easy to implement, but is really going to waste a lot of space unless all integers are rather small.

The second option has the major disadvantage that you have to convey changes in n somehow in the output bitstream. For instance, each value will have to have a length associated with it. This means you are storing two integers (albeit smaller integers) for every input value. There's a good chance you'll increase the file size with this method.

The advantage of Huffman or LZW is that they create codebooks in such a way that the length of the codes can be derived from the output bitstream without actually storing the lengths. These techniques allow you to get very close to the Shannon limit.

I decided to give your original idea (constant n, remove unused bits and pack) a try for fun and here is the naive implementation I came up with:

#include <sys/types.h>
#include <stdio.h>

int pack(int64_t* input, int nin, void* output, int n)
{
    int64_t inmask = 0;
    unsigned char* pout = (unsigned char*)output;
    int obit = 0;
    int nout = 0;
    *pout = 0;

    for(int i=0; i<nin; i++)
    {
        inmask = (int64_t)1 << (n-1);
        for(int k=0; k<n; k++)
        {
            if(obit>7)
            {
                obit = 0;
                pout++;
                *pout = 0;
            }
            *pout |= (((input[i] & inmask) >> (n-k-1)) << (7-obit));
            inmask >>= 1;
            obit++;
            nout++;
        }
    }
    return nout;
}

int unpack(void* input, int nbitsin, int64_t* output, int n)
{
    unsigned char* pin = (unsigned char*)input;
    int64_t* pout = output;
    int nbits = nbitsin;
    unsigned char inmask = 0x80;
    int inbit = 0;
    int nout = 0;
    while(nbits > 0)
    {
        *pout = 0;
        for(int i=0; i<n; i++)
        {
            if(inbit > 7)
            {
                pin++;
                inbit = 0;
            }
            *pout |= ((int64_t)((*pin & (inmask >> inbit)) >> (7-inbit))) << (n-i-1);
            inbit++;
        }
        pout++;
        nbits -= n;
        nout++;
    }
    return nout;
}

int main()
{
    int64_t input[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};
    int64_t output[21];
    unsigned char compressed[21*8];
    int n = 5;

    int nbits = pack(input, 21, compressed, n);
    int nout = unpack(compressed, nbits, output, n);

    for(int i=0; i<=20; i++)
        printf("input: %lld   output: %lld\n", input[i], output[i]);
}

This is very inefficient because is steps one bit at a time, but that was the easiest way to implement it without dealing with issues of endianess. I have not tested this either with a wide range of values, just the ones in the test. Also, there is no bounds checking and it is assumed the output buffers are long enough. So what I am saying is that this code is probably only good for educational purposes to get you started.

Jason
+1 for covering several options
RaphaelSP
A: 

I don't think you can avoid iterating across the elements. AFAIK Huffman encoding requires the frequencies of the "symbols", which unless you know the statistics of the "process" generating the integers, you will have to compute (by iterating across every element).

S.C. Madsen
Unless you work with a static huffman tree (eg predefined)
Ritsaert Hornstra
When the huffman tree is pre-defined, that means you already know the "statistics" of the generating process (as I wrote). Sorry if my explanation was unclear on this.
S.C. Madsen
A: 

If you have fixed sizes, e.g. you know your number is 38bit rather than 64, you can build structures using bit specifications. Amusing you also have smaller elements to fit in the remaining space.

struct example {
    /* 64bit number cut into 3 different sized sections */
    uint64_t big_num:38;
    uint64_t small_num:16;
    uint64_t itty_num:10;

    /* 8 bit number cut in two */
    uint8_t  nibble_A:4;
    uint8_t  nibble_B:4;
};

This isn't big/little endian safe without some hoop-jumping, so can only be used within a program rather than in a exported data format. It's quite often used to store boolean values in single bits without defining shifts and masks.

Steven
But these structures would use more space than the my `int[]`! The point is to save space by moving bits around (possibly) in place.
pajton
A: 

This was my solution, but it has been written for 32 bit architecture. I guess the changes to make it work on 64 bit are not that complicated though.

http://maz-programmersdiary.blogspot.com/2010/06/more-tools-to-write-better-c.html

Maz