views:

243

answers:

2

I am trying to calculate an initial buffer size to use when decompressing data of an unknown size. I have a bunch of data points from existing compression streams but don't know the best way to analyze them.

Data points are the compressed size and the ratio to uncompressed size. For example: 100425 (compressed size) x 1.3413 (compression ratio) = 134,700 (uncompressed size)

The compressed data stream doesn't store the uncompressed size so the decompressor has to alloc an initial buffer size and realloc if it overflows. I'll looking for the "best" initial size to alloc the buffer given the compressed size. I have over 293,000 data points.

A: 

Given that you have a lot of data points of how your compression works, I'd recommend analyzing your compression data, to get a mean compression standard and a standard deviation. Then, I'd recommend setting your buffer size initially to your original size * your compression size at 2 standard deviations above the mean; this will mean that your buffer is the right size for 93% of your cases. If you want your buffer to not need reallocation for more cases, increase the number of standard deviations above the mean that you're allocating for.

McWafflestix
Hello, thanks for the quick response. Are you suggesting that I should calculate the mean and stdev on the compressed data size, ratio or uncompressed data size? Not sure what you mean by "original size * your compression size"? What did you mean by "your compression size"?
Dwight Kelly
Also, should I remove duplicates prior to calculating the mean and stddev?
Dwight Kelly
I'd recommend a read on statistics before implementing, but no, don't remove duplicates; they're important data points when calculating frequency. And you want to calculate your average compression ratio (the mean) and the standard deviation of that compression ratio (the standard deviation). So let's say your average compresssion ratio is (say) 1.5, and your standard deviation is (say) 0.1; if you assume a compression ratio of 1.7 (1.5 + 2 * 0.1), you will cover 93% of your cases (because you're covering 2 stddevs above the mean).
McWafflestix
I had already tried a couple of ways to segment the sample data. However I still get strange results. For example, if I look simply at compressed sizes of 0..100,000 bytes the ratio mean is 1.2214, stddev 116.4742 and the initial buffer ratio using your calculation would be 234.1698
Dwight Kelly
I'd suspect the way by which you are calculating your stddev is wrong somehow. In fact, with those numbers, I'm almost certain of it.
McWafflestix
using excel's @stdev() function
Dwight Kelly
Excel may well be reporting two things; the mean may be reported as a ratio, but the stddev may be being reported as a deviation (i.e., not a ratio). You might want to use a small dataset to test excel's behavior; ideally a dataset that you will know the results in advance, so you can use it as a test of how excel does things.
McWafflestix
A: 

One simple method is to use a common initial decompression buffer size and double the size at each realloc. This is also used in many dynamic libraries.

bill
I can't do that since some compressed streams are over 500MB in size
Dwight Kelly
I think you must consider the peak memory usage for all your streams and not the average memory needed. If you use only one decompression buffer, then you must have this memory available anyway. What compression library was used?
bill