tags:

views:

137

answers:

5

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.

What purpose do such magic values serve?

+6  A: 

Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.

Ben Voigt
If it's going to fail anyway, then why does it need to detect the bad magic number? Presumably other portions of the file wopuldn't make sense if it was a different file format.
Billy ONeal
@Billy - with some file formats, you can't necessarily tell if the data is 'bad'. For example, without a magic number, it would be pretty tough to programmatically determine whether a file was a bitmap.
Seth
Also then you wouldn't be able to differentiate a corrupt file in a known format from a (possibly) valid file in an unknown format.
Fabio Ceconello
Billy: Not true at all. I once wrote code to auto-detect and load one of 7 or 8 file types of some specific scientific data. Those formats without magic numbers made it almost impossible, since many were flexible enough that something in format X was actually also valid format Y. In the end, I loaded it in both and looked at the actual data sets to see which made more sense. (For example, if in format X the data-type marker indicated "% change from previous" data, and *all* the readings were above 1000%, it was probably some other format.) In general, though, that's not possible.
Ken
@Billy: The processing along the way before failure occurs could be very expensive. You mentioned rar which is an archival format. Let's think of some other archival formats like gzip. They are streaming, so you can't make a first pass over the file to see if everything is reasonable, you have to output on the first pass. Output might be to hard disk (or even slower storage media), and you might write multiple gigabytes of "decompressed" output before failing. The only way to prevent that is some sort of internal consistency check. Best probably would be a CRC inited with the magic bytes.
Ben Voigt
+1  A: 

To quickly identify the type of the file, or the positions within it.

Paul Butcher
+1  A: 

Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!

Suggestions:

  • Programs that undelete files by reading disk free space may recognize file types
  • Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
  • When you lose extensions, programs like file can detect what your files are
  • Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
  • As you have a header, it does not cost much to put it at header start.
Benoit
+3  A: 

the concept of magic numbers goes back to unix and pre-dates the use of file extensions. The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.

+1  A: 

If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).

Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.

BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.

supercat