tags:

views:

676

answers:

12

I have a program where I generate bitstreams, of about 80 to 150 bits or so, which I would like to compress, because I'm going to turn them into some kind of ASCII string so people can transmit them around.

Does anyone know of a good, free bit-aware compressor that might work on such a stream? My main problem with the "standard options" is this stream should really be treated as bits, not bytes, else the structure is lost, and their overhead swamps any gain.

Addition:

The reason I want to compress these streams is because users are going to be cutting+pasting them, probably using something like base64 encoding, so saving some data is helpful.

Here is an example, for those who would like to see it. I'll add formatting to make it easier to read:

110 110 - This is a 6x6 grid (the maximum is 7x7, so we only need 3 bits!)

000000
011110
010010
010010
011110
000000 - This is one layout grid

000000
000000
001000
000100
000000
000000 - This is the second layout grid

Now we list some pieces

010 11111111 - A piece is a 3-bit colour code, then an 8-bit list of 'on / off' bits.
001 10101010 - Another bit!
001 10101010 - Another, identical bit!

The reason I say this should be considered 'as bits' is that there is obvious compression options when viewed as a bitstream (in particular, usually many 0s in the 'grid's), which disappear when you consider it as a byte-stream.

+3  A: 

I'd guess that no general purpose algorithm will give you great compression for this kind of data.

Your best bet is to analyze the structure of your data and try to find a custom compression algorithm or possibly customize an existing one (maybe with a pre-filled dictionary or something like that).

Joachim Sauer
A: 

The zlib compression (maybe the same algorithm as gzip) is free. It has a few settings, but I am not sure how much you can save, unless there is some periodic pattern to your bits.

Since the png and gif graphics files are essentially representations of bit patterns, perhaps you can find the compression algorithm they use.

Dude, GIF and PNG are lossy. You don't want to use them for anything except images.
T.E.D.
I think you are thinking of jpeg compression. I'm pretty sure that png and gif are lossless.
KeithB
gif and png aren't lossy!
Nils Pipenbrinck
I looked it up. PNG is lossless, and GIF is lossless for 256 colors or less.
PNG has a lossless and a lossy version actually. Still, they are designed for images, not general data.
T.E.D.
BTW, the Deflate algorithm in PNG is used in .zip files.
devstuff
+2  A: 

Since the streams are so small, can you post some of them here?

Also are you sure that there is enough redundancy in those streams to even allow compression? Are there any repeating blocks of data?

It's kind of a longshot, but in the absence of any concrete answers, you might want to look into the ROM scene and check out how strings of text were compressed in cartridge-based RPG games like "Chrono Trigger" or "Final Fantasy III." I know that the text strings were compressed in those games (bytes were so precious in those days) and unraveling the scheme proved a fun challenge for hackers. That's the only thing that came to my mind when you mentioned lots of short little strings being compressed.

Your root problem might remain, though. I would imagine that the compression schemes in those ROMs exploited redundancy across many strings (ie, if "Timbuktu" occurred in 58 different strings) and not so much within a single stream.

John Booty
A: 

What you want is lossless binary compression. I am sure there are papers or web articles if not tons of other resources out there. Google those terms and i suspect you will get what you need.

How much data are you talking about? Is your pipe small or the throughput so high that you have to compress?

In retrospect, your data is so small that you are probably not going to get worthwhile gains unless you analyze your traffic and do your own "compression" which is basically just a mapping/hash of known bit patterns.

as someone else said, post some sample data and there is probably better advice after that.

Tim
+7  A: 

What are you hoping to accomplish by compressing 150 bits? Unless you aggregate several of this 19b messages, I'm not sure what you hope to gain. Is it a UI issue--wherein you want your users to send/receive "codes"?

How about base 64 encoding? This will take binary data and turn it into coded characters for easy transmission or entry.

Michael Haren
+1 for base64 - which I think is what he really needs.
Draemon
A: 

I've had the same thought as Tim - such a small amount of data barely seems worth compressing. As a matter of fact, I'd suggest that what you really want to look into is some sort of ascii encoding method, like uuencode or mime-encode (aka "Base64").

Paul Tomblin
A: 

CCITT's Group 3 and Group 4 lossless encoding schemes, used in compressing G3 and G4 TIFFs, were designed with binary data in mind. G4 TIFFs are black and white images usually used for OCR-ing and faxes. Another simple scheme that comes to mind is RLE.

codelogic
+1  A: 

I would suggest you look into using zlib. It is downloadable, and the license allows you to use it for pretty much any project. An important point is that it is widely used, and thus well debugged. If your data is important, you don't want to have to debug odd edge cases in a hombrew algorithm at random dates in the future.

I've messed around with it a bit, and it does allow a stream-oriented compression. I'm not real sure how good it is when you are just feeding it a small amount of data at a time though. Loss-less compression tends to work by finding and eliminating patterns, and there won't be a lot of patterns to find if you are feeding it something small like 12 bytes at a time.

I'm not voing Juan's answer up because he also suggests using GIF which is a lossy compression. You didn't give a lot of info, but I'm guessing you don't want any compression format that actually looses data. Most popular graphic, audio, and video compression algrithms are lossy; they rely on the ability of human senses to take in an image or sound properly with some of the original information removed or modified slightly.

T.E.D.
GIF is lossless for 256 colors or less.
+3  A: 

Chris, thanks for posting those samples. I think run-length encoding is the way you want to go. That should be pretty trivial to implement.

http://en.wikipedia.org/wiki/Run-length_encoding

Will work well with all those consecutive 0's.

So the primary reason to compress these strings is to make them easier to cut and paste? Makes sense; that sounds like an interesting project.

If you are just trying to make the strings more human-manageable it sounds like you're all set. If you are trying to compress them so that they transmit faster over the wire I think the benefit of compressing small strings may be defeated by other TCP issues like MTU sizes and all that. (I'm not experienced there, so take that last bit with many grains of salt)

Good luck!

John Booty
A: 

JBIG might give you what you need.

http://en.wikipedia.org/wiki/JBIG

http://www.jpeg.org/jbig/index.html

http://www.cl.cam.ac.uk/~mgk25/jbigkit/

JBIG is used to compress 1-bpp fax images.

David Poole
A: 

Just to add to what's already been said, isn't "compressing a small amount of data" intrinsically a bit pointless? If you could elaborate on the data, the platform or the uses that might help.

As for the bits vs ascii - I'm not entirely sure what you're getting at, but as mentioned by Michael, Base64 provides a way to make arbitrary binary more friendly.

Note that any conversion from binary into ascii is the opposite of compression.

Draemon
A: 

My first suggestion is that you look into range encoding. Instead of

1: compressing from bit data into binary data and then

2: encoding binary data into base64 ASCII data,

you could pack your bits directly into the range 0-N (where N is the number of printable characters you are using minus 1) and then do a dead-easy mapping.

My second suggestion is that you study the filter methods employed by PNG and think about whether similar methods could be used to render your data more compressible. It's difficult to tell from just two sample layout grids, but it seems very likely from your first grid that some method such as "predict each pixel based on its neighbors above and to the left, and then convert each pixel to 0 if it met its prediction and 1 if it defied its prediction" could give you a much more uniform set of data, and thus greater compression.