views:

105

answers:

4

It used to be that disk compression was used to increase storage space at the expense of efficiency but we were all on single processor systems back then.

These days there are extra cores around to potentially do the decompression work in parallel with processing the data.

For I/O bound applications (particularly read heavy sequential data processing) it might be possible to increase throughput by only reading and writing compressed data to disk.

Does anyone have any experience to support or reject this conjecture?

+4  A: 

Yes! In fact, processors are so ridiculously fast now that it even makes sense for memory. (IBM does this, I believe.) I believe, some of the current big iron machines even do compression on the CPU cache.

Jörg W Mittag
+2  A: 

Yes, this makes perfect sense. On NT-based Windows OS's it's widely accepted that sometimes enabling NTFS compression can be faster than disabling it for precisely this reason. This has been true for years and multicore should only make it more true.

dsimcha
+4  A: 

Take care not to confuse disk seek times and disk read rates. It takes millions of CPU cycles (milliseconds) to seek to the right track on the disk. Once you're there, you can read 10s of megabytes of data per second, assuming low fragmentation.

Whether or not the data is compressed on the disk, you still have to seek. The question becomes, is the disk read time for compressed data + the decompression time < disk read time for uncompressed data. Decompression is relatively fast, since it amounts to replacing a short token with a longer one. In the end, it probably boils down to how well the data was compressed and how big it was in the first place. If you're reading a 2KB compressed file instead of a 5KB original, it's probably not worth it. If you're reading a 2MB compressed file instead of a 25MB original, it likely is.

Measure with a reasonable workload.

George V. Reilly
Very helpful, I needed to clarify the difference between seeking and reading in my thinking about this. So the expectation is if there were lots of little data files the disk performance will be dominated by seeking and compression isn't going to help but for reading big files it may?
Alex Stoddard
That would be my expectation, but it takes measurement to know for sure.
George V. Reilly
+1  A: 

I think it also depends on how aggressive your compression is vs how IO bound you are.

For example, DB2's row compression feature is targeted for IO bound application: data warehouses, reporting systems, etc. It uses a dictionary-based algorithm and isn't very aggressive - resulting in 50-80% compression of data (tables, indexes in storage as well as when in memory). However - it also tends to speed queries up by around 10%.

They could have gone with much more aggressive compression, but then would have taken a performance hit.

KenFar