views:

863

answers:

7

When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:

  • have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
  • have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
  • specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
  • possibly reserve some space for further per-file attributes that might be necessary in the future?

What else would be useful to make the format more future-proof and minimize headache in the future?

+7  A: 

Take a look at the PNG spec. This format has some very good rationale behind it.

Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.

As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.

Stepan Stolyarov
PNG is one example. Others with a similar structure are IFF (Interchange File Format, mainly used on the Commodore Amiga) and RIFF (for example WAV or AVI).
BlaM
+3  A: 

I would consider defining a substructure that higher levels use to store data, a little like a mini file system inside the file.

For example, even though your file format is going to store application-specific data, I would consider defining records / streams etc. inside the file in such a way that application-agnostic code is able to understand the layout of the file, but not of course understand the opaque payloads.

Let's get a little more concrete. Consider the usual ways of storing data in memory: generally they can be boiled down to either contiguous expandable arrays / lists, pointer/reference-based graphs, and binary blobs of data in particular formats.

Thus, it may be fruitful to define the binary file format along similar lines. Use record headers which indicate the length and composition of the following data, whether it's in the form of an array (a list of identically-typed records), references (offsets to other records in the file), or data blobs (e.g. string data in a particular encoding, but not containing any references).

If carefully designed, this can permit the file format to be used not just for persisting data in and out all in one go, but on an incremental, as-needed basis. If the substructure is properly designed, it can be application agnostic yet still permit e.g. a garbage collection application to be written, which understands the blobs, arrays and reference record types, and is able to trace through the file and eliminate unused records (i.e. records that are no longer pointed to).

That's just one idea. Other places to look for ideas are in general file system designs, or relational database physical storage strategies.

Of course, depending on your requirements, this may be overkill. You may simply be after a binary format for persisting in-memory data, in which case an approach to consider is tagged records.

In this approach, every piece of data is prefixed with a tag. The tag indicates the type of the immediately following data, and possibly its length and name. Lists may be suffixed with an "end-list" tag that has no payload. The tag may have an embedded identifier, so tags that aren't understood can be ignored by the serialization mechanism when it's reading things in. It's a bit like XML in this respect, except using binary idioms instead.

Actually, XML is a good place to look for long-term longevity of a file format. Look at its namespacing capabilities. If you construct your reading and writing code carefully, it ought to be possible to write applications that preserve the location and content of tagged (recursively) data they don't understand, possibly because it's been written by a later version of the same application.

Barry Kelly
+1  A: 

One way to future proof the file would be to provide for blocks. Straight after your file header data, you can begin the first block. The block could have a byte or word code for the type of block, then a size in bytes. Now you can arbitrarily add new block types, and you can skip to the end of a block.

Bork Blatt
+5  A: 

It all depends on the purpose of the format, of course.

One flexible approach is to structure entire file as TLV (Tag-Length-Value) triplets. For example, make your file comprized of records, each record beginning with a 4-byte header:

1 byte  = record type
3 bytes = record length
followed by record content

Regarding the endianness, if you store endianness indicator in the file, all your applications will have to support all endianness formats. On the other hand, if you specify a particular endianness for your files, only applications on platforms with non-matching endiannes will have to do additional work, and it can be decided at compile time (using conditional compilation).

atzz
TLV seems to be everywhere, so much everywhere that I fail to see it's point everywhere.
Cheery
+1  A: 

I agree with atzz's suggestion of using a Tag Length Value system. For future compatibility, you could store a set of "pointers" to TLV entries at the start (or maybe Tag,Pointer and have the pointer point to a Length,Value; or perhaps Tag,Length,Pointer and then have all the data together elsewhere?).

So, my file could look something like:

magic number/file id
version
tag for first data entry
pointer to first data entry --------+
tag for second data entry           |
pointer to second data entry        |
...                                 |
length of first data entry <--------+
value for first data entry
...

magic number, version, tags, pointers and lengths would all be a predefined set length, for easy decoding. Say, 2 bytes. Or 4, depending on what you need. They don't all need to be the same (eg, all tags are 1 byte, pointers are 4 etc).

The tag lets you know what is being stored. The pointer tells you where (either an offset or absolute value, in bytes), the length tells you how large the data is, and the value is length bytes of data of type tag. If you use a MyFileFormat v1 decoder on a MyFileFormat v2 file, the pointers allow you to skip sections which the v1 decoder doesn't understand. If you simply skip invalid tags, you can probably simply use TLV instead of TPLV.

I would either hand code something like that, or maybe define my format in ASN.1 and generate a codec (I work in telecommunications, so ASN.1/TLV makes sense to me :-D)

Dan
A: 

Make sure that you reserve a tag code (or better yet reserve a bit in each tag) that specifies a deleted/free block/chunk. Blocks can then be deleted by simply changing a block's current tag code to the deleted tag code or set the tag's deleted bit. This way you don't need to right away completely restructure your file when you delete a block.

Reserving a bit in the tag provides the the option of possibly undeleting the block (if you leave the block's data unchanged).

For security, however you might want to zero out the deleted block's data, in this case you would use a special deleted/free tag.

I agree with Stepan, that you should choose an endianess, but I would also have an endianess indicator in the file. If you use an endianess indicator you might consider using one of the UniCode Byte Order Marks also as an inidicator of any UniCode text encoding used for any text blocks. The BOM is usually the first few bytes of UniCoded text files, so if your BOM is the first entry in your file there might be a problem of some utility identifying your file as UniCode text (I don't think this is much an issue). I would treat/reserve the BOM as one of your normal tags (using either the UTF16 BOM if using the 16bit tags or the UTF32 BOM if using 32bit tags) with a 0 length block/chunk.

See also http://en.wikipedia.org/wiki/File_format

Roger Nelson
+2  A: 

I agree that these are good ideas:

  1. Magic numbers at the beginning. Pretty much required in *nix:

  2. File version number for backwards compatibility.

  3. Endianness specification.

But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).

  • possibly reserve some space for further per-file attributes that might be necessary in the future?

Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.

In addition to 1-3 above, I'd add these:

  • simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.

  • version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.

  • Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).

And here are some optional ones:

  • If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.

  • If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.

Bart