views:

637

answers:

4

Is there a library in .net that does multithreaded compression of a stream? I'm thinking of something like the built in System.IO.GZipStream, but using multiple threads to perform the work (and thereby utilizing all the cpu cores).

I know that, for example 7-zip compresses using multiple threads, but the C# SDK that they've released doesn't seem to do that.

+5  A: 

I think your best bet is to split the data stream at equal intervals yourself, and launch threads to compress each part separately in parallel, if using non-parallelized algorithms. (After which a single thread concatenates them into a single stream (you can make a stream class that continues reading from the next stream when the current one ends)).

You may wish to take a look at SharpZipLib which is somewhat better than the intrinsic compression streams in .NET.

EDIT: You will need a header to tell where each new stream begins, of course. :)

Cecil Has a Name
Yeah, I agree with this, I can't think of any specifically parallel compression libraries. If someone was to write one, I can't think how it would work apart from splitting the raw data up into chunks and compressing each on a thread. Be aware that if you split it into too smaller chunks you will reduce the efficiency of the compression (both time and size).
Simon P Stevens
Good mention of SharpZipLib, I'm actually already using it. Regarding splitting the stream, yes, I'm aware of that solution, unfortunately, the requirement is to compress a single stream that gets fed to my code, and to write out to a single compressed stream, so chunking the incoming data is not really an option.
Gareth
Seems like you are looking for very fine-grained threading, or "micro-parallelization" if you like. If you have the time you might find a way to modify subroutines of #ZipLib to use parallelized loops, such as those found in Parallel.NET (or whatever it's called).
Cecil Has a Name
A: 

A compression format (but not necessarily the algorithm) needs to be aware of the fact that you can use multiple threads. Or rather, not necessarily that you use multiple threads, but that you're compressing the original data in multiple steps, parallel or otherwise.

Let me explain.

Most compression algorithms compress data in a sequential manner. Any data can be compressed by using information learned from already compressed data. So for instance, if you're compressing a book by a bad author, which uses a lot of the same words, clichés and sentences multiple times, by the time the compression algorithm comes to the second+ occurrence of those things, it will usually be able to compress the current occurrence better than the first occurrence.

However, a side-effect of this is that you can't really splice together two compressed files without decompressing both and recompressing them as one stream. The knowledge from one file would not match the other file.

The solution of course is to tell the decompression routine that "Hey, I just switched to an altogether new data stream, please start fresh building up knowledge about the data".

If the compression format has support for such a code, you can easily compress multiple parts at the same time.

For instance, a 1GB file could be split into 4 256MB files, compress each part on a separate core, and then splice them together at the end.

If you're building your own compression format, you can of course build support for this yourself.

Whether .ZIP or .RAR or any of the known compression formats can support this is unknown to me, but I know the .7Z format can.

Lasse V. Karlsen
+2  A: 

Found this library: http://www.codeplex.com/sevenzipsharp

Looks like it wraps the unmanaged 7z.dll which does support multithreading. Obviously not ideal having to wrap unmanaged code, but it looks like this is currently the only option that's out there.

Gareth
A: 

Normally I would say try Intel Parallel studio, which lets you develop code specifically targetted at multi-core systems, but for now it does C/C++ only. Maybe create just lib in C/C++ and call that from your C# code?

Colin
I don't see how this would help. If he is calling a compression library that isn't multi threaded, calling it from a c++ lib that was written with Intel Parallel studio isn't going to make it multi threaded. Is it? (Perhaps it is, I've never used it)
Simon P Stevens