views:

289

answers:

3

What is the best way to find out i java.io.InputStream contains zipped data?

+4  A: 

Not very elegant, but reliable:

If the Stream can be read via ZipInputStream, it should be zipped.

The MYYN
It just doesn't seem nice. Couldn't it be a corrupted ZIP stream?
Fedearne
@fedearne: Is a corrupted zip stream a zip stream?
GvS
I agree: If ZipInputStream can't read it, it doesn't *matter* that it's "meant" to be a Zip file. Right?
Carl Smotricz
This is most reliable option. If it's corrupted, how do you know it were ZIP? You just have to make a guess.
ZZ Coder
@GvS I have stream that are Zipped and stream that are not zipped. I would rather not attempt to parse corrupted zip streams as not zipped, if this could be avoided.
Fedearne
If you check for 4 magic bytes, 1 out of 4.294.967.295 (completely random) streams will be a false positive. Can you afford that? Are corrupted streams something that will occur more frequently as a non zipped stream starting with the magic bytes?
GvS
+11  A: 

The magic bytes for the ZIP format are 50 4B. You could test the stream (using mark and reset - you may need to buffer) but I wouldn't expect this to be a 100% reliable approach. There would be no way to distinguish it from a US-ASCII encoded text file that began with the letters PK.

The best way would be to provide metadata on the content format prior to opening the stream and then treat it appropriately.

McDowell
+4  A: 

You could check that the first four bytes of the stream are the local file header signature that starts the local file header that proceeds every file in a ZIP file, as shown in the spec here to be 50 4B 03 04.

A little test code shows this to work:

byte[] buffer = new byte[4];

try {
    ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("so.zip"));
    ZipEntry ze = new ZipEntry("HelloWorld.txt");
    zos.putNextEntry(ze);
    zos.write("Hello world".getBytes());
    zos.close();

    FileInputStream is = new FileInputStream("so.zip");
    is.read(buffer);
    is.close();
}
catch(IOException e) {
    e.printStackTrace();
}

for (byte b : buffer) { 
    System.out.printf("%H ",b);
}

Gave me this output:

50 4B 3 4
Dave Webb
I had the same idea (though trusted Wikipedia over the spec - for shame!), but it seems that this is not a reliable mechanism: _"Implementers should be aware that ZIP files may be encountered with or without this signature marking data descriptors and should account for either case when reading ZIP files to ensure compatibility."_
McDowell
That's true for a general perspective, but my guess is that if you don't have the signature ZipInputStream will fail as it insists on ZipEntry objects.
Dave Webb
You can have random junk prepended to zip files (such as Microsoft Windows executables). Those only work if you use the central directory rather than streaming with local headers. FWIW, the Java PlugIn and WebStart use the central directory but now check the first four bytes as well (see GIARs).
Tom Hawtin - tackline
(Sorry, GIFARs.)
Tom Hawtin - tackline
Not sure if ZipInputStream will fail on that input. In an intelligent implementation, it will seek forward and *find* that signature. This is the way it's done in self-extracting archives, which on windows, have the PE-COFF signature at the beginning of the file, and the PKZIP zip entry signature within the file, wherever the zip entries are. The file is both an EXE and a ZIP. Will java's ZipInputStream read this stream? I don't know but it *should*. The ZipInputStream class in other implementations (in DotNetZip for example) can and will read this as a zip stream.
Cheeso