Determining the best initial buffer size for decompressing streamed compressed data

views:

243

answers:

+2 Q:

Determining the best initial buffer size for decompressing streamed compressed data

I am trying to calculate an initial buffer size to use when decompressing data of an unknown size. I have a bunch of data points from existing compression streams but don't know the best way to analyze them.

Data points are the compressed size and the ratio to uncompressed size. For example: 100425 (compressed size) x 1.3413 (compression ratio) = 134,700 (uncompressed size)

The compressed data stream doesn't store the uncompressed size so the decompressor has to alloc an initial buffer size and realloc if it overflows. I'll looking for the "best" initial size to alloc the buffer given the compressed size. I have over 293,000 data points.

Given that you have a lot of data points of how your compression works, I'd recommend analyzing your compression data, to get a mean compression standard and a standard deviation. Then, I'd recommend setting your buffer size initially to your original size * your compression size at 2 standard deviations above the mean; this will mean that your buffer is the right size for 93% of your cases. If you want your buffer to not need reallocation for more cases, increase the number of standard deviations above the mean that you're allocating for.

McWafflestix 2009-06-11 16:47:26

Hello, thanks for the quick response. Are you suggesting that I should calculate the mean and stdev on the compressed data size, ratio or uncompressed data size? Not sure what you mean by "original size * your compression size"? What did you mean by "your compression size"?

Dwight Kelly 2009-06-11 17:30:20

Also, should I remove duplicates prior to calculating the mean and stddev?

Dwight Kelly 2009-06-11 17:31:39

I'd recommend a read on statistics before implementing, but no, don't remove duplicates; they're important data points when calculating frequency. And you want to calculate your average compression ratio (the mean) and the standard deviation of that compression ratio (the standard deviation). So let's say your average compresssion ratio is (say) 1.5, and your standard deviation is (say) 0.1; if you assume a compression ratio of 1.7 (1.5 + 2 * 0.1), you will cover 93% of your cases (because you're covering 2 stddevs above the mean).

McWafflestix 2009-06-11 17:36:53

I had already tried a couple of ways to segment the sample data. However I still get strange results. For example, if I look simply at compressed sizes of 0..100,000 bytes the ratio mean is 1.2214, stddev 116.4742 and the initial buffer ratio using your calculation would be 234.1698

Dwight Kelly 2009-06-11 17:55:50

I'd suspect the way by which you are calculating your stddev is wrong somehow. In fact, with those numbers, I'm almost certain of it.

McWafflestix 2009-06-11 18:13:44

using excel's @stdev() function

Dwight Kelly 2009-06-11 18:34:14

Excel may well be reporting two things; the mean may be reported as a ratio, but the stddev may be being reported as a deviation (i.e., not a ratio). You might want to use a small dataset to test excel's behavior; ideally a dataset that you will know the results in advance, so you can use it as a test of how excel does things.

McWafflestix 2009-06-11 18:38:51

One simple method is to use a common initial decompression buffer size and double the size at each realloc. This is also used in many dynamic libraries.

bill 2009-06-14 18:53:44

I can't do that since some compressed streams are over 500MB in size

Dwight Kelly 2009-06-15 18:02:55

I think you must consider the peak memory usage for all your streams and not the average memory needed. If you use only one decompression buffer, then you must have this memory available anyway. What compression library was used?

bill 2009-06-15 18:22:53

ansaurus

tags:

views:

answers:

Determining the best initial buffer size for decompressing streamed compressed data

related questions