tags:

views:

1203

answers:

3

I'm trying to decompress about 8000 files in gzip format in Java. My first try was to use GZIPInputStream but the performance was awful.

Anyone know any alternative to decompress gzip archives? I tried ZipInputStream but it's not recognizing the gzip format.

Thank you in advance.

A: 

For that kind of scale, you might want to go native, assuming your platform requirements are limited. You can use JNI to call a library or invoke a native command using ProcessBuilder.

sblundy
The Sun JRE implementation is native code via JNI already.
erickson
Interesting. That implies that the unzip step isn't the problem or can't be improved.
sblundy
+3  A: 

You need to use buffering. Writing small pieces of data is going to be inefficient. The compression implementation is in native code in the Sun JDK. Even if it wasn't the buffered performance should usually exceed reasonable file or network I/O.

OutputStream out = new BufferedOutputStream(new GZIPOutputStream(rawOut));

InputStream in = new BufferedInputStream(new GZIPInputStream(rawIn));

As native code is used to implement the decompression/compression algorithm, be very careful to close the stream (and not just the underlying stream) after use. I've found having loads of `Deflaters' hanging around is very bad for performance.

ZipInputStream deals with archives of files, which is a completely different thing from compressing a stream.

Tom Hawtin - tackline
The performance improved but not much :\
Rui Carneiro
As it uses native code, make very sure to close the gzip stream.
Tom Hawtin - tackline
+3  A: 

When you say that GZipInputStream's performance was awful, could you be more specific? Did you find out whether it was a CPU bottleneck or an I/O bottleneck? Were you using buffering on both input and output? If you could post the code you were using, that would be very helpful.

If you're on a multi-core machine, you could try still using GZipInputStream but using multiple threads, one per core, with a shared queue of files still to process. (Any one file would only be processed by a single thread.) That might make things worse if you're I/O bound, but it may be worth a try.

Jon Skeet