views:

133

answers:

5

I've been doing some reading on file formats and I'm very interested in them. I'm wondering what the process is to create a format. For example, a .jpeg, or .gif, or an audio format. What programming language would you use (if you use a programming language at all)?

The site warned me that this question might be closed, but that's just a risk I'll take in the pursuit of knowledge. :)

+22  A: 

what the process is to create a format. For example, a .jpeg, or .gif, or an audio format.

Step 1. Decide what data is going to be in the file.

Step 2. Design how to represent that data in the file.

Step 3. Write it down so other people can understand it.

That's it. A file format is just an idea. Properly, it's an "agreement". Nothing more.
Everyone agrees to put the given information in the given format.

What programming language would you use (if you use a programming language at all)?

All programming languages that can do I/O can have file formats. Some have limitations on which file formats they can handle. Some languages don't handle low-level bytes as well as others.

But a "format" is not an "implementation".

The format is a concept. The implementation is -- well -- an implementation.

S.Lott
Great explanation. I like the "agreement" word, fits perfectly.
PeterK
+3  A: 

You do not need a programming language to write the specification for a file format, although a word processor might prove to be a handy tool.

Basically, you need to decide how the information of the file is to be stored as a sequence of bits. This might be trivial, or it might be exceedingly difficult. As a trivial example, a very primitive bitmap image format could start with one unsigned 32-bit integer representing the width of the bitmap, and then one more such integer representing the height of the bitmap. Then you could decide to simply write out the colour of the pixels sequentially, left-to-right and top-to-bottom (row 1 of pixels, row 2 of pixels, ...), using 24-bits per pixel, on the form 8 bits for red + 8 bits for green + 8 bits for blue. For instance, a 8×8 bitmap consisting of alternating blue and red pixels would be stored as

00000008000000080000FFFF00000000FFFF0000...

In a less trivial example, it really depends on the data you wish to save. Typically you would define a lot of records/structures, such as BITMAPINFOHEADER, and specify in what order they should come, how they should be nestled, and you might need to write a lot of indicies and look-up tables. Myself I have written quite a few file formats, most recently the ASD (AlgoSim Data) file format used to save AlgoSim structures. Such files consists of a number of records (maybe nestled), look-up tables, magic words (indicating structure begin, structures end, etc.) and strings in a custom-defined format. One typical thing that often simplifies the file format is that the records contain data about their size, and the sizes of the custom data parts following the record (in case the record is some sort of a header, preceeding data in a custom format, e.g. pixel colours or sound samples).

If you havn't been working with file formats before, I would suggest that you learn a very simple format, such as the Windows 3 Bitmap format, and write your own BMP encoder/decoder, i.e. programs that creates and reads BMP files (from scratch), and displays the read BMP files. Then you now the basic ideas.

Andreas Rejbrand
+3  A: 

Fundamentally, files only exist to store information that needs to be loaded back in the future, either by the same program or a different one. A really good file format is designed so that:

  1. Any programming language can be used to read or write it.
  2. The information a program would most likely need from the file can be accessed quickly and efficiently.
  3. The format can be extended and expanded in the future, without breaking backwards compatibility.
  4. The format should accommodate any special requirements (e.g. error resiliency, compression, encoding, etc.) present in the domain in which the file will be used
Richard Walters
+1  A: 

You are most certainly interested in looking into Protocol Buffers and Thrift. These tools provide a modern, principled way of designing forwards and backward compatible file formats.

rodrigob
I wouldn't do that. But the OP's mileage may vary.
Andreas Rejbrand
A: 

Thanks for the answers, guys. Very informative and helpful.

orbit82