views:

821

answers:

9

Hi!

I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.

+1  A: 

I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).

EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)

Noldorin
I thought Huffman encoding would work only if there are some repeating elements. Here we MAY not have much repeating elements.
Niyaz
Yeah I think Niyaz is right: Huffman efficiency increases with the number of repeated symbols. If you had repeated symbols in the raw input list, you may as well just use RLE (seeing as they're sorted)
Alabaster Codify
Yes, the fact that they are ordered would suggest it is better to encode the differences.
Noldorin
+7  A: 

Are the integers grouped in a dense way or a sparse way?

By dense I'm referring to:

[1, 2, 3, 4, 42, 43, 78, 79, 80, 81]

By sparse I'm referring to:

[1, 4, 7, 9, 19, 42, 53, 55, 78, 80]

If the integers are grouped in a dense way you could compress the first vector to hold three ranges:

[(1, 4), (42, 43), (78, 81)]

Which is a 40% compression. Of course this algorithm does not work well on sparse data as the compressed data would take up 100% more space than the original data.

dalle
It wouldnt take 100% more space if you allowed to have the plain number and, when able, ranges, instead of always ranges.
Juan
Juan, I don't know but I don't think it is possible to do that. You will have much overhead to store data like that.
Niyaz
A: 

I'd use something bog standard off the shelf before investing in your own scheme.

In Java for example you can use GZIPOutputStream to apply gzip compression.

pauldoo
+14  A: 

If you are storing integers which are close together (eg: 1, 3 ,4, 5, 9, 10 etc... ) rather than some random 32 bit integers (982346..., 3487623412.., etc) you can do one thing:

Find the differences between the adjacent numbers which would be like 2,1,1,4,1... etc.(in our example) and then Huffman encode this numbers.

I don't think Huffman encoding will work if you directly apply them to the original list of numbers you have.

But if you have a sorted list of near-by numbers, the odds are good that you will get a very good compression ratio by doing Huffman encoding of the number differences, may be better ratio than using the LZW algorithm used in the Zip libraries.

Anyway thanks for posting this interesting question.

Niyaz
Another good thing with this algorithm is that the numbers does not need to be close to each other. As long as there is some kind of structure. If there is a structure of the values this algorithm will catch it, the Huffman encoding will do the rest.
dalle
Hmm.. Interesting! Thanks for pointing that out delle.
Niyaz
Would indeed work quite as well on 1000,2000,3000,5000,7000,800 for instance. But there are other distributions which makes this sub-optimal.
MSalters
What happens if we look at the deltas more than once? Does this increase or decrease the probability of good compression? For series like `n=n+a;a+=1` this would certainly help.
Georg
+1  A: 

The conditions on the lists of integers is slightly different, but the question Compression for a unique stream of data suggests several approaches which could help you.

I'd suggest prefiltering the data into a start and a series of offsets. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.

After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.

Blair Conrad
A: 

Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.

John D. Cook
+3  A: 

As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.

Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.

However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!

So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.

MSalters
+2  A: 

If you need fast random-access lookup, then a Huffman-encoding of the differences (as suggested by Niyaz) is only half the story. You will probably also need some sort of paging/indexing scheme so that it is easy to extract the nth number.

If you don't do this, then extracting the nth number is an O(n) operation, as you have to read and Huffman decode half the file before you can find the number you were after. You have to choose the page size carefully to balance the overhead of storing page offsets against the speed of lookup.

Simon Nickerson
+1  A: 

MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.

But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.

Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.

alecco
After a certain threshold of numbers has been reached there _will_ be repeated patterns, would be interesting to know when. Therefore I think using the algorithm Niyaz suggested works well.
Georg
Yes indeed. But as simonn noted a naïve implementation can bring random access towards O(n). Also I think there could be distributions that wouldn't help, an adaptive-Huffman could deal with it. And finally, quite often a trivial RLE (on the increments) would do the job on very few lines of code.
alecco