views:

522

answers:

3

I have a text file that I want to send over the network, this file could vary in size from as low as 1KB to 500KB.
What algorithms/techniques could I use to tightly compress this file before sending it such that the least amount of bytes are send over the network and compression ratio is high?

+4  A: 

The answer depends on the content. GZip is included in the jdk. Tests on random strings seem to average 33% reduction in size.

[edit: content, not context]

Is GZIP browser dependant? I am not using a browser here only a self sufficient java client talking to a remote servlet.
Kevin Boyd
Java has the functionality built in (because of Jar files). It should always be available.
Bill K
Its a standard feature. see http://java.sun.com/javase/6/docs/api/java/util/zip/package-summary.html
+3  A: 

For compression, I'd consider gzip, bzip2 and LZMA (this is not an exhaustive list but these are IMO the most famous).

Then, I'd look for some benchmarks on the net and try to gather metrics for various files type (text, binary, mixed) and size (small, big, huge). Even if you're mostly interested by compression ratio, you might want to look at: the compression ratio, the compression time, the memory footprint, the decompression time.

According to A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA:

[...] gzip is very fast and has small memory footprint. According to this benchmark, neither bzip2 nor lzma can compete with gzip in terms of speed or memory usage. bzip2 has notably better compression ratio than gzip, which has to be the reason for the popularity of bzip2; it is slower than gzip especially in decompression and uses more memory. However the memory requirements of bzip2 should be nowadays no problem even on older hardware.

[...]

LZMA clearly has potential to become the third commonly used general purporse compression format on *NIX systems. It mainly competes with bzip2 by offering significantly better compression ratio while still keeping decompressing speed relatively close to that of gzip.

This is confirmed in LZMA - better than bzip2:

The description is impressive, in short:

  • Better compression ratio (with best compression level when gzip achieves 38%, bzip2 34%, LZMA has 25%).
  • The compression is ratio gain is seen mainly on binary files.
  • Decompress time is much faster (3-4 times) than bzip2.
  • The algorithm allows to be executed in parallel (but the tool I'll describe here is one-thread).

There are also disadvantages:

  • Compression (excluding lower levels) is much slower than bzip2.
  • Memory requirements are much bigger during compression than bzip2.

So, for the compression of text files, the same site reports:

First thing I used LZMA for was compressing my mail archive. The spam file (mail in mbox format) I chose is 528MB big and I will use maximum compression ratio. During compression the lzma process was 370MB big, that's much :) bzip2 was below 7MB. It took almost 15 minutes to compress the file by lzma and less than 4 minutes by bzip2. Compression ration was very similar: output file is 373MB for bzip2 and 370MB for lzma. Decompression time is 1m12s for lzma and 1m48s for bzip2.

Finally, here is another resource with graphical results: Compression Tools: lzma, bzip2 & gzip

I'd really recommend to perform your own bench (as you'll be compressing text only and very small to small files) to get real metrics in your environment, but my bet is that LZMA won't provide a significant advantage on small text files so bzip2 would be a decent choice (even if the time and memory overhead of LZMA might be low on small files).

If you plan to perform the compression from Java, you'll find a LZMA implementation here, a bzip2 implementation here (coming from Apache Ant AFAIK), gzip being included in the JDK. If you don't want to or can't rely on a third party library, use gzip.

Pascal Thivent
A: 

It depends. Can you control the network packet size? Are you going to bundle them if more than 1 will fit in a packet? Are you limited by CPU on either end? Not really the question, but still related since it can take longer to compress & decompress than to send the bytes at times.

Wayne Young