views:

579

answers:

7

I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.

In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.

Thanks!

Edit

I'm running plain ol' Java SE 6 update 17 on Windows XP.

Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!

Edit 2

I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...

In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?

Edit 3 Code snippet I used:

GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME)); DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));

+13  A: 

AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.

You could, however, have multiple threads, each unzipping a different file.

That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).

More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).

Uri
+1 for noting the IO bottleneck. This is too often overlooked in this such cases.
BalusC
No no no no no no no do NOT multithread I/O! Operating systems already synchronize IO between multiple applications and having additional level of threads on top of the IO abstraction provided and especially reading kills the entire computer if you start using more than one thread for it.
Esko
My own experience is that it doesn't kill the computer, you just don't see any real benefit from the MT.
Uri
Multithreaded IO usually leads to lots of HD seeks which, though they may not kill the computer, can definitely send it into coma for a while.
Michael Borgwardt
+3  A: 

Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.

OutputStream out = new BufferedOutputStream(
    new GZIPOutputStream(
        new FileOutputStream(myFile)
    )
)

And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.

Sam Barnum
Should I wrap my GZIP streams in Buffered streams, or my Buffered streams in GZIP streams? For example:new GZIPInputStream(new BufferedInputStream(...))new GZIPOutputStream(new BufferedOutputStream(...))vs.new BufferedInputStream(new GZIPInputStream(...))new BufferedOutputStream(new GZIPOutputStream(...))
Rudiger
I believe you should wrapper your BufferedStream around your GZIP stream. This will make your I/O more independent of whatever blocking the unZIPper is doing.
Carl Smotricz
i have always believed the GZIPOutputStream is already buffered
Chii
That it has a `flush()` method would hint that it's buffered, yes. Also, there's a constructor that lets you specify the buffer size. Enough evidence for me :)
Carl Smotricz
GZIPInputStream may be buffered already, but when I explicitly added a BufferedInputStream around it, it went several times faster.
Rudiger
@Rudiger, can you poste a code snippet of you you're using the stream? Are you using an ObjectOutputStream?
Sam Barnum
Just a FileInputStream; code snippet is attached to the original question.
Rudiger
Maybe it's just the default buffer size of GZIPOutputStream (512) vs. the default buffer size of BufferedOutputStream ( 8192 ).I'd be curious if you get good results from removing the Buffered stream and just upping the buffer size to 8192.
Sam Barnum
A: 

Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.

GregS
Actually, compression is immensely helped by parallel processing.
Rudiger
Whether parallel processing helps compression or decompression depends on the compression function.
Chip Uni
+2  A: 

I'm not seeing any answer addressing the other processing of your program.

If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.

If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.

You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.


Update

It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.

Carl Smotricz
I'm not just extracting files using Java, but I can see it was a little ambiguous in my question.
Rudiger
A: 

compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.

fuzzy lollipop
Maybe someone can write zipped input/output stream that follows the same API as InputStream and OutputStream, but for the multi-core era. However, I/O is the bottleneck.
Rudiger
the it wouldn't be gzip anymore it would be some custom format
fuzzy lollipop
A: 

Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.

However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.

Michael Dillon
+2  A: 

PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.

Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.

See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.

George