views:

337

answers:

10

I am looking as a new file format specification and the specification says the file can be either xml based or a zip file containing an xml file and other files.

The file extension is the same in both cases. What ways could I test the file to decide if it needs decompressing or just reading?

+8  A: 

You could look at the magic number of the file. The ones for ZIP archives are listed on the ZIP format wikipedia page: PK\003\004 or PK\005\006.

Amber
Yep, but just so the op know... a 'valid magic number' does not guarantee that the file is not corrupt or of a wrong type.
KMan
Indeed. However, if their problem is just differentiating between two valid formats, then the magic number is the way to go.
Amber
There is no magic number for a zip file. Often, zip files begin with these sequences, but not every zip file does.
Cheeso
+1  A: 

Check the first few bytes of the file for the magic number. Zip files begin with PK (50 4B). As XML files cannot start with these characters and still be valid, you can be fairly sure as to the file type.

Yacoby
There is no magic number for zip files. If Wikipedia says or suggests that there is, it's wrong.
Cheeso
@Cheeso Yes there is. Please read the format http://www.pkware.com/documents/casestudies/APPNOTE.TXT and note the "local file header signature" and its defined value.
Yacoby
I understand why you would think that, from reading the text, but it is not correct. The text is fuzzy, but in practice, there is no magic number. http://en.wikipedia.org/wiki/ZIP_(file_format) as well as practical experience demonstrates that you are interpreting the spec incorrectly, in assuming a magic number. Examine a Self-extracting archive generated by WinZip or Infozip. It is both a PE-COFF file and a zip file. It uses the MZ magic number, but can be read as a zipfile by compliant ZIP tools.
Cheeso
A: 

Just check if the first bytes of file are ASCII symbols or not. If it is, then you have XML as it normal text file. If not - you have zipped data.

For more complicated situations you may need to check the Magic Number.

Andrejs Cainikovs
* ZIP files always begin with 4 bytes in the ASCII range* It's possible for ZIP files to be composed entirely of bytes in the ASCII range* What happens if the XML file uses an encoding that uses bytes outside the ascii range? Like any UTF8/16/32 file with a BOM or with non-latin characters?
Joe Gauterin
NO, zip files do not always begin with 4 bytes in the ASCII range. Zip files DO NOT also always begin with PK, or 50 4b. The misunderstanding is very common, but still wrong.
Cheeso
A: 

You could try unzipping it - an XML file is exceedingly unlikely to be a valid zip file, or could check the magic numbers, as others have said.

Dominic Rodger
+1  A: 

File magic numbers

PoweRoy
There is no magic number for a zip file.
Cheeso
Yes there is a magic number: zip files start with PK (50 4B 03 04)
PoweRoy
+1  A: 

You can use file to see if it's a text file(xml) or an executable(zip). Scroll down to see an example.

ccheneson
oops, i thought there would be a system call file() as well.
ccheneson
A: 

it depends on what you are using but the zip library might have a function that test wether a file or not is a zip file something like is_zip, test_file_zip or whatever ...

or create you're own function by using the magic number given above.

solsTiCe
There is no magic number for a zip file.
Cheeso
+8  A: 

The zip file format is defined by PKWARE. You can find their file specification here.

Near the top you will find the header specification:

A. Local file header:

    local file header signature     4 bytes  (0x04034b50)
    version needed to extract       2 bytes
    general purpose bit flag        2 bytes
    compression method              2 bytes
    last mod file time              2 bytes
    last mod file date              2 bytes
    crc-32                          4 bytes
    compressed size                 4 bytes
    uncompressed size               4 bytes
    file name length                2 bytes
    extra field length              2 bytes

    file name (variable size)
    extra field (variable size)

From this you can see that the first 4 bytes of the header should be the file signature which should be the hex value 0x04034b50. Byte order in the file is the other way round - PKWARE specify that "All values are stored in little-endian byte order unless otherwise specified.", so if you use a hex editor to view the file you will see 50 4b 03 04 as the first 4 bytes.

You can use this to check if your file is a zip file. If you open the file in notepad, you will notice that the first two bytes (50 and 4b) are the ASCII characters PK.

Simon P Stevens
+1 Great info. But ideally, it would vary from vendor to vendor, which means the compression algorithm.
KMan
http://en.wikipedia.org/wiki/ZIP_(file_format)
KMan
The ZIP file format does not vary from vendor to vendor. It was defined originally by PKWARE, but many other vendors now support the same compression format. The format specifics the PK in the header, so even other vendors will still include this part of the header. Different file formats like arc, 7z, lhz, gzip etc will have different specifications and different headers, but a zip file will always have this in the header.
Simon P Stevens
"the byte order in a file is the other way round" if your system is little-endian.
Steve Jessop
@Steve: Yeah, I clarified that. PKWARE specify little-endian in the format.
Simon P Stevens
The first 2 bytes in a zip are often, but not necessarily, PK. The header given here is a header for a zipentry, which may appear anywhere in the zip file. The zipfile need not start with a zip entry, and need not start with PK. There is no "magic number" for a zip file. The first zip entry need not be "near" the top of the file. There's nothing in the spec that requires that. While it is not required that a zip file begin with a zip entry, it is typical.
Cheeso
@Cheeso. That's interesting info. Thanks. Are you sure though? The document I have referenced specifically states that the "Overall .ZIP file format" begins with a "[local file header 1]", which starts with the bytes mentioned. Do you have a reference for what you are saying?
Simon P Stevens
I'm sure. The reference is the zip spec: PKWare's Appnote.txt. http://www.pkware.com/documents/casestudies/APPNOTE.TXT It never says "All zip files begin with 'xxxx'". On Windows, a self-extracting zip is both a zip and a PE-COFF file. The PE-COFF requires a magic number, the zip does not. A GIF or JPG, both of which have magic numbers, can also hold zip content.
Cheeso
@Cheeso It clearly states that all zip files should begin with the "local file header signature" which must have a value of 0x04034b50. A zip file must have a *local file header*, the first 4 bytes of which are the *local file header signature* which should have a value of 0x04034b50. I cannot see how it can be read any other way.
Yacoby
@Yacoby - No where does it say that a zip file must begin with that header. It states that each entry in the zip must start with that header. Valid zip files do not need to begin with that header. There is no zip magic number.
Cheeso
@Cheeso. Sorry, I think the spec does say that the zip file format begins with 0x04034b50. Yes, it says that each entry begins with that number, but it also says that the first thing in the file is the first zip entry, so by necessity the number is first. We aren't talking about self extracting zips, or zip *data* embedded in jpeg files. This question is specifically about zip files - that is files containing only the zip format and nothing else. I understand there are lots of possible uses for zip data, and it can be embedded inside other formats, but standalone zip files do start with PK.
Simon P Stevens
You and I interpret it differently Simon. My interpretation agrees with that of the WinZip, InfoZip, and PKZip tools. Self-extracting ZIP files *are* ZIP files. They conform to the zip format. Rename them from .exe to .zip and they work like a ZIP file, in any tool. PK is the first pair of bytes in a large majority of zip files, but not all.
Cheeso
@Cheeso, Yes I think we do clearly have different interpretations. I would argue that self-extracting zip files do not conform to the zip format, they are an extension to it. Sure, zip tools can treat them in the same way, but that doesn't make them zip files. Photoshop can open jpeg and jpeg2000 files, and to an end user they would appear to be handled the same, but that doesn't mean the formats are the same. Anyway though, I don't want to get into a huge debate. I do see your point that being strict not all files containing zip data start with PK. I've learnt something new here, so thanks.
Simon P Stevens
+1  A: 

Not a good solution though, but just thinking out load... how about:

try
{
LoadXmlFile(theFile);//Exception if not an xml file
}
catch(Exception ex)
{
LoadZipFile(theFile)
}
KMan
I voted this up, however personally I do not like using try catch to control the program. I am looking for a more exact test. Thanks for your input though.
Phil Hannent
+1  A: 

You could check the file to see if it contains a valid XML header. If it doesn't, try decompressing it.

See Click here for XML specification.

Thomas Matthews