What is the current state of text-only compression algorithms? | ansaurus

tags:

views:

926

answers:

2

+5 Q:

What is the current state of text-only compression algorithms?

In honor of the Hutter Prize, what are the top algorithms (and a quick description of each) for text compression?

Note: The intent of this question is to get a description of compression algorithms, not of compression programs.

+2 A:

There's always lzip.

All kidding aside:

Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.

Sören Kuklau 2008-10-25 14:29:44

These come no where near PAQ and several other text-only compression algorithms (http://en.wikipedia.org/wiki/PAQ)

Brian R. Bondy 2008-10-25 14:47:11

+5 A:

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:

The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.

Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

Corbin March 2008-10-26 18:37:57

related questions

Which is the fastest TAR application?

Software for creating PNG 8bit transparent images?

How do you know if the HTTP Compression is Working?

Find the prefix substring which gives best compression

Compress Script Resources of ASP.Net

How to search through archived files with Perl

What's a good compression library for Java?

compression library for c and php

fastest c++ file compression library available?

What is the best way to pack javascript without getting perfomance flaws?

Does AES (128 or 256) encryption expand the data? If so, by how much?

How can I determine the length of a .wav file in C#?

Is there a quality, file-size, or other benefit to JPEG sizes being multiples of 8px or 16px?

Are there any compression and encryption libraries in C#?

Zlib-compatible compression streams?

Best way to compress HTML, CSS & JS with mod_deflate and mod_gzip disabled

How to compress JPEG images with ASP on Windows CE

Why doesn't ZIP Compression compress anything?

Best javascript compressor

Does a YUI Compressor GUI App Exist?

Best/fastest compression format for (sqlserver) databases?

ASP.NET gzip compression corrupting CSS

What is the easiest way to add compression to WCF in Silverlight?

How can I encode xml files to xfdl (base64-gzip)?

Video Compression: What is discrete cosine transform?