views:

94

answers:

3

Which compression method in Python has the best compression ratio?

Is the commonly used zlib.compress() the best or are there some better options? I need to get the best compression ratio possible.

I am compresing strings and sending them over UDP. A typical string I compress has about 1,700,000 bytes.

+4  A: 

I'm sure that there might be some more obscure formats with better compression, but lzma is the best, of those that are well supported. There are some python bindings here.

EDIT

Don't pick a format without testing, some algorithms do better depending on the data set.

mikerobi
My data set is a long string representing a 640x480 image.
Richard Knop
If you were already using a image format with a good compression algorithm, is it necessary then to compress the whole thing again?
joni
@joni The image was not really that well compressed. zlib.compress() shrinked it to cca 30% of the original size. But I am looking for even more drastic compression.
Richard Knop
@Richard Knop, in that case, you might get better compression by converting the images to a compressed image format. If it has to be lossless compression I would try PNG. PNG uses zlib compression, which is less efficient than lzma, but does some pre-filtering which will likely result in a better overall. In theory you could replace the zlib compression in PNG with LZMA, but that isn't something you can just casually do in Python.
mikerobi
using a compressed image format like PNG was what I meant, too.
joni
@mikerobi: Why should it be lossless compression ? Lossy compression will usually give much better results if it's compatible with the problem.
kriss
@kriss, that is why I qualified my comment with "if it has to be lossy", I was just guessing based on the OP that is not what he wanted.
mikerobi
+2  A: 

If you are willing to trade performance for getter compression then the bz2 library usually gives better results than the gz (zlib) library.

There are other compression libraries like xz (LZMA2) that might give even better results but they do not appear to be in the core distribution of python.

Python Doc for BZ2 class

EDIT: Depending on the type of image you might not get much additional compression. Many image formats are previously compressed unless it is raw, bmp, or uncompressed tiff. Testing between various compression types would be highly recommended.

EDIT2: If you do decide to do image compression. Image Magick supports python bindings and many image conversion types.

Image Magick

Image Formats Supported

CtRanger
It's raw image. Not compresed. zlib.compress() shrinked it to 30% of size.
Richard Knop
Since it is a raw image the LZMA binding should do a little bit better than the BZ2 library. As suggested above you should be able to use a lossless image compression with a good / better result.
CtRanger
@CtRanger: you mean *lossy* not *lossless* ? Isn't it ?
kriss
@kriss: It would depend on the image compression algorithm. I think that PNG would still do better than just a block compression on raw data.
CtRanger
@CtRanger: any compression scheme that would be pixel aware and image size aware is probably better than general purpose compression algorithm. That's true for PNG. But you can get even better results if some data loss (even invisible to the eye, like jpeg with high quality) can be afforded.
kriss
I totally agree with that statement. There might be some requirement on the data that prevents any sort of lossy since they were originally looking at raw + block compression. If there is no requirement then a slight lossy algorithm would be the best choice.
CtRanger
@CtRanger: I checked the compression method used by PNG. As a matter of fact it uses DEFLATE (same algorithm as zlib). That means it's probably not taking into account image size but just pixel size (each "character" being a pixel).
kriss
@kriss: This is true. The big difference would be that PNG uses image aware pre-filters to optimize the data before the block compression stage. Depending on the details of the image this can be much better than DEFLATE or probably just about the same.
CtRanger
+1  A: 

If you are dealing with images you should definitely choose a lossy compression format (ie: pixel aware) preferably to any lossless one. That will give you much better results. Recompressing with a lossless format over a lossy one is a loss of time.

I would search through PIL to see what I can use. Something like converting image to jpeg with a compression ratio compatible with researched quality before sending should be very efficient.

You should also be very cautious if using UDP, it can lose some packets, and most compression format are very sensible to missing parts of file. OK. That can be managed at application level.

kriss
Which lossy compression format do you recommend?
Richard Knop
JPEG is a good lossy image format, PNG is a good lossless image format. I wouldn't say lossy is always preferred, it really depends on the data. Lossy is preferred for noisy images; photos, scans, etc. Lossless works well for graphs, line art, etc.
adw
@adw: I agree with you PNG and JPEG are both good formats, but if you take into account compression ratio, jpeg is much better. I checked PNG compression and it just use DEFLATE (same algorithm as used in zlib).
kriss
@RichardKnop: actual image format shouldn't change much architecture of program. Best format may depends of the type of image you're managing, the number of colors, the loss of quality that is acceptable or not, etc. My guess is that jpeg is probably a good candidate (it is *not* the best as say jpeg2000 or some fractal based algorithms give better compression, but it's very well supported and likely to give good results).
kriss
@kriss: As I said, it depends on the data. For some kinds of data PNG is much better (both in visual quality and compression ratio). PNG doesn't just use deflate, it uses prediction filters to (often greatly) improve compression ratio (e.g. Paeth predictor).
adw