I'm searching for an algorithm to compress small text strings: 50-1000 bytes (i.e. URLs). Which algorithm works best for this?
If you are talking about actually compressing the text not just shortening then Deflate/gzip (wrapper around gzip), zip work well for smaller files and text. Other algorithms are highly efficient for larger files like bzip2 etc.
Wikipedia has a list of compression times on this page: http://en.wikipedia.org/wiki/Comparison_of_file_archivers (look for comparison of efficiency)
Name ↓ Text % ↓ Text time (s) ↓ Binaries % ↓ Binaries time (s) ↓ Raw images % ↓ Raw images time (s) ↓
7-zip 19 18.8 27 59.6 50 36.4
bzip2 20 4.7 37 32.8 51 20.0
rar (2.01) 23 30.0 36 275.4 58 52.7
advzip 24 21.1 37 70.6 57 41.6
gzip 25 4.2 39 23.1 60 5.4
zip 25 4.3 39 23.3 60 5.7
lha
Huffman has a static cost, the Huffman table, so I disagree it's a good choice.
There are adaptative versions which do away with this, but the compression rate may suffer. Actually, the question you should ask is "what algorithm to compress text strings with these characteristics". For instance, if long repetitions are expected, simple Run-Lengh Encoding might be enough. If you can guarantee that only English words, spaces, punctiation and the occasional digits will be present, then Huffman with a pre-defined Huffman table might yield good results.
Generally, algorithms of the Lempel-Ziv family have very good compression and performance, and libraries for them abound. I'd go with that.
With the information that what's being compressed are URLs, then I'd suggest that, before compressing (with whatever algorithm is easily available), you CODIFY them. URLs follow well-defined patterns, and some parts of it are highly predictable. By making use of this knowledge, you can codify the URLs into something smaller to begin with, and ideas behind Huffman encoding can help you here.
For example, translating the URL into a bit stream, you could replace "http" with the bit 1, and anything else with the bit "0" followed by the actual procotol (or use a table to get other common protocols, like https, ftp, file). The "://" can be dropped altogether, as long as you can mark the end of the protocol. Etc. Go read about URL formt, and think on how they can be codified to take less space.
Any algorithm/library that supports a preset dictionary, e.g. zlib.
This way you can prime the compressor with the same kind of text that is likely to appear in the input. If the files are similar in some way (e.g. all URLs, all C programs, all StackOverflow posts, all ASCII-art drawings) then certain substrings will appear in most or all of the input files.
Every compression algorithm will save space if the same substring is repeated multiple times in one input file (e.g. "the" in English text or "int" in C code.)
But in the case of URLs certain strings (e.g. "http://www.", ".com", ".html", ".aspx" will typically appear once in each input file. So you need to share them between files somehow rather than having one compressed occurrence per file. Placing them in a preset dictionary will achieve this.
Check this: http://github.com/antirez/smaz/tree/master
"Smaz is a simple compression library suitable for compressing very short strings."
I don't have code to hand but I always liked the approach of building a 2D lookup table of size 256 * 256 chars. To compress a string you loop over each char and use the lookup table to get the 'predicted' next char using the current and previous char as indexes into the table. If there is a match you write a single 1 bit, otherwise write a 0, the char and update the lookup table with the current char. This approach basically maintains a dynamic (and crude) lookup table of the most probable next character in the data stream.
You can start with a zeroed lookup table but obviosuly it works best on very short strings if it is initialised with the most likely character for each character pair, e.g. for the english language. So long as the initial lookup table is the same for compression and decompression you don't need emit it into the comrpessed data.
This algorithm doesn't give a brilliant compression ratio but it is incredibly frugal with memory and CPU resources and can also work on a continuous stream of data - the decompressor maintains its own copy of the lookup table as it decompresses, thus the lookup table adjusts to the type of data being comrpessed.