views:

214

answers:

6

I have binary array in c, I want to compress the array, kindly suggest me algorithm which compress binary array. I have used Lempel–Ziv–Welch (LZW) algorithm but its not suitable for me because there is no repetition in my data.

A: 

If you have binary data, you will most likely treat them as something like a char[]. In your question and comment you state that there is (almost) no repetition, which is only possible if you do not have more than 256 (char) data items.

But I guess you have more data and so compression is possible. If the frequency of your data items is not evenly distributed, you may have some luck with a simple Huffman coding.

To give you more precise advice we need more details about the kind of data you want to compress.

Frank Bollack
@Frank, I have unsigned char[] in which I have stuffed the data, the size of array is 1.66 KB and I want to transfer the array on serial port with in 4 second, so that I want to compress the data.
Arman
Then take that data and run it through some of the popular comression libraries(try zlib as a start) and see which performs best. Just because you don't see a pattern in the data doesn't mean it cannot be compressed with a certain algoritm..
nos
+1  A: 

You may have no repetition, but there could still be a pattern in the data which could be taken advantage of. This requires knowing more about the data than that there is no repetition, though.

If you data is actually (or nearly) randomly distributed then compressing it is going to run into the Pidgin Hole problem. This states that if you only have X pidgins and Y holes to put them in, and X > Y, then you don't have enough room. In compression this means that you aren't able to take advantage of the ability to not store some pidgins which are identical twins of one already in a hole, and just leave a note to decompression algorithm to clone that pidgin. In Huffman coding, all pidgins are clones of pidgins in the pidgin library. In several other compression schemes some pidgins may be mega-pidgins made up of other pidgins.

nategoose
@Nategoose, I am not understanding what do you mean by pattern in the data, actually I have the data which is coming from an other source.
Arman
@Arman: If by examining an initial chunk of data you would be able to predict the next possible chunks of data with better than random probability then you have a pattern which can be exploited by a compression algorithm (possibly a new one that you have to invent). Taken to the extreme this could be sending the code of a psudo-random number generator and it's seed rather than the stream of it's output, but less extreme would be sending a chunk of data followed by a symbol that told you if the next chunk was included inline or was X of Y guesses.
nategoose
@Arman: This is just a generalization of the idea that repeating data compression takes advantage of. In that you use the idea that the next data is previous data exactly, while in what I just talked about the next data would be some function of previous data rather than exactly previous data.
nategoose
+2  A: 

Why not just to use the libz's deflate? As added bonus, libz is available on pretty much every existing platform.

Or newer LZMA? It beats the bzip2 on binary data compression.

Dummy00001
LZMA is very good, I am using it to compress data buffers.
Sergius
@Sergius, Is there any repetition in your data? Or are your data have some pattern?
Arman
+1  A: 

You can cut the space in half easily!

Since your binary data has NO repetition, your only options are [0, 1], [1, 0]. Anything more would repeat either a zero or a one. Therefore, you can just represent the first set with a 0 and the second set with a 1. Encoding would look something like this...

encode [0, 1] = 0
encode [1, 0] = 1

And decoding would be...

decode 0 = [0, 1]
decode 1 = [1, 0]

Sorry for the haskell syntax, it's just so much more readable in this case. This turns your two element array into a one element array, and can be stored in half the space! Magic.

EDIT: This ignores the trivial case of [0] and [1]. If those need to be handled (although you shouldn't really be compressing 1 bit), it is impossible to get a better compression ratio than 100%.

Clark Gaebel
@Clark, If I have [1, 1] actual data, then it will be [1, 0, 1, 0] at decoding end.
Arman
I have maximum four possibilities [1, 0], [0, 1], [1, 1], [0, 0].
Arman
You can't have [1, 1]. The rules specify that there are NO repetitions. If you have [1, 1], you have one repetition of 1.
Clark Gaebel
A: 

Alternatively: Your binary data is representing certain values. You could reduce the bit count of all values. You need to know the possible range and write and read the data bitwise. This might save a lot of space if you for example store value in uint32 that only need a few bits.

Code Clown
@Code, I have stuffed all the bits with respect to range of values, there is no space to reduce the memory size.
Arman
@Arman: Okay, so your data is dense, has low entropy and no repetition. YOu could do a huffmann encoding to see if compression is useful at all.
Code Clown
A: 

Compression is not magic. If your data is completely random, there is no compression algorithm available which can make it smaller.

Most data is not completely random, but it is up to you to discover the optimum way to express it so that the patterns can be detected. Images and sound are common enough that standard algorithms have been developed, but no more can be said about your specific problem without getting many more details.

Mark Ransom