ansaurus

Question

Answer 1

+2 A:

I know this might seem like the obvious thing to say as I'm sure there's actually a solution, but why not use a smaller type, like uint8_t (max 255)? or uint16_t (max 65535)?. I'm sure you could bit-manipulate on an int64_t using defined values and or operations and the like, but, aside from an academic exercise, why?

And on the note of academic exercises, Bit Twiddling Hacks is a good read.

Ninefingers 2010-03-07 19:57:15

+1 for cool link. Well, this can sometimes be int64_t with, say, 49 bits useful. So using smaller type in not an option.

pajton 2010-03-07 21:51:07

Answer 2

+5 A:

Most any compression algorithm will get close to the minimum entropy needed to encode the integers, for example, Huffman coding, but accessing it like an array will be non-trivial.

keraba 2010-03-07 20:56:42

Interesting idea, +1.

Ninefingers 2010-03-07 22:21:54

The point is I'd like to write it later to a file, so I need to bitpack it first to save disk space.

pajton 2010-03-07 22:25:00

If you want to minimize disk usage, you should look for a compression library instead of rolling your own.

Georg Fritzsche 2010-03-07 23:23:06

Well, I am actually sort of rolling my own, hence the question:).

pajton 2010-03-07 23:44:58

Answer 3

+3 A:

I agree with keraba that you need to use something like Huffman coding or perhaps the Lempel-Ziv-Welch algorithm. The problem with bit-packing the way you are talking about is that you have two options:

Pick a constant n such that the largest integer can be represented.
Allow n to vary from value to value.

The first option is relatively easy to implement, but is really going to waste a lot of space unless all integers are rather small.

The second option has the major disadvantage that you have to convey changes in n somehow in the output bitstream. For instance, each value will have to have a length associated with it. This means you are storing two integers (albeit smaller integers) for every input value. There's a good chance you'll increase the file size with this method.

The advantage of Huffman or LZW is that they create codebooks in such a way that the length of the codes can be derived from the output bitstream without actually storing the lengths. These techniques allow you to get very close to the Shannon limit.

I decided to give your original idea (constant n, remove unused bits and pack) a try for fun and here is the naive implementation I came up with:

#include <sys/types.h>
#include <stdio.h>

int pack(int64_t* input, int nin, void* output, int n)
{
    int64_t inmask = 0;
    unsigned char* pout = (unsigned char*)output;
    int obit = 0;
    int nout = 0;
    *pout = 0;

    for(int i=0; i<nin; i++)
    {
        inmask = (int64_t)1 << (n-1);
        for(int k=0; k<n; k++)
        {
            if(obit>7)
            {
                obit = 0;
                pout++;
                *pout = 0;
            }
            *pout |= (((input[i] & inmask) >> (n-k-1)) << (7-obit));
            inmask >>= 1;
            obit++;
            nout++;
        }
    }
    return nout;
}

int unpack(void* input, int nbitsin, int64_t* output, int n)
{
    unsigned char* pin = (unsigned char*)input;
    int64_t* pout = output;
    int nbits = nbitsin;
    unsigned char inmask = 0x80;
    int inbit = 0;
    int nout = 0;
    while(nbits > 0)
    {
        *pout = 0;
        for(int i=0; i<n; i++)
        {
            if(inbit > 7)
            {
                pin++;
                inbit = 0;
            }
            *pout |= ((int64_t)((*pin & (inmask >> inbit)) >> (7-inbit))) << (n-i-1);
            inbit++;
        }
        pout++;
        nbits -= n;
        nout++;
    }
    return nout;
}

int main()
{
    int64_t input[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};
    int64_t output[21];
    unsigned char compressed[21*8];
    int n = 5;

    int nbits = pack(input, 21, compressed, n);
    int nout = unpack(compressed, nbits, output, n);

    for(int i=0; i<=20; i++)
        printf("input: %lld   output: %lld\n", input[i], output[i]);
}

This is very inefficient because is steps one bit at a time, but that was the easiest way to implement it without dealing with issues of endianess. I have not tested this either with a wide range of values, just the ones in the test. Also, there is no bounds checking and it is assumed the output buffers are long enough. So what I am saying is that this code is probably only good for educational purposes to get you started.

Jason 2010-03-08 16:53:58

+1 for covering several options

RaphaelSP 2010-03-08 20:57:01

Answer 4

A:

I don't think you can avoid iterating across the elements. AFAIK Huffman encoding requires the frequencies of the "symbols", which unless you know the statistics of the "process" generating the integers, you will have to compute (by iterating across every element).

S.C. Madsen 2010-03-08 20:53:16

Unless you work with a static huffman tree (eg predefined)

Ritsaert Hornstra 2010-03-09 11:09:32

When the huffman tree is pre-defined, that means you already know the "statistics" of the generating process (as I wrote). Sorry if my explanation was unclear on this.

S.C. Madsen 2010-03-09 12:17:23

Answer 5

A:

If you have fixed sizes, e.g. you know your number is 38bit rather than 64, you can build structures using bit specifications. Amusing you also have smaller elements to fit in the remaining space.

struct example {
    /* 64bit number cut into 3 different sized sections */
    uint64_t big_num:38;
    uint64_t small_num:16;
    uint64_t itty_num:10;

    /* 8 bit number cut in two */
    uint8_t  nibble_A:4;
    uint8_t  nibble_B:4;
};

This isn't big/little endian safe without some hoop-jumping, so can only be used within a program rather than in a exported data format. It's quite often used to store boolean values in single bits without defining shifts and masks.

Steven 2010-03-13 13:15:59

But these structures would use more space than the my `int[]`! The point is to save space by moving bits around (possibly) in place.

pajton 2010-03-13 15:18:56

Answer 6

A:

This was my solution, but it has been written for 32 bit architecture. I guess the changes to make it work on 64 bit are not that complicated though.

http://maz-programmersdiary.blogspot.com/2010/06/more-tools-to-write-better-c.html

Maz 2010-09-18 09:29:02

ansaurus

tags:

views:

answers:

Bit packing of array of integers

related questions