views:

71

answers:

4

I have a program that reads and writes a binary file. A file is interchangeable between executions of the program on the same platform, but a file produced on one machine may not be valid on another platform due to the sizes of types, endian-ness etc.

I want a quick way to be able to assert that a given file is valid for reading on a given architecture. I am not interested in making a file cross-architecture (in fact the file is memory-mapped structs). I only want a way of checking that the file was created on an architecture with the same size types, etc before reading it.

One idea is to write a struct with constant magic numbers in to the start of the file. This can be read and verified. Another would be to store the sizeof various types in single-byte integers.

This is for C but I suppose the question is language-agnostic for languages with the same kinds of issues.

What's the best way to do this?

I welcome amendments for the title of this question!

+2  A: 

I like the magic number at the start of the file idea. You get to make these checks with a magic value:

  • If there are at least two magic bytes and you treat them as a single multi-byte integer, you can detect endianness changes. For instance, if you choose 0xABCD, and your code reads 0xCDAB, you're on a platform with different endianness than the one where the file was written.

  • If you use a 4- or 8-byte integer, you can detect 32- vs. 64-bit platforms, if you choose your data type so it's a different size on the two platforms.

  • If there is more than just an integer or you choose it carefully, you can rule out the possibility of accidentally reading a file written out by another program to a high degree of probability. See /etc/magic on any Unixy type system for a good list of values to avoid.

Warren Young
+1  A: 
#include <stdint.h>

union header {
     uint8_t a[8];
     uint64_t u;
};

const struct header h = { .u = (sizeof(      short  ) <<  0 )
                             | (sizeof(        int  ) <<  8 ) 
                             | (sizeof(       long  ) << 16 ) 
                             | (sizeof(   long long ) << 24 )
                             | (sizeof(       float ) << 32 )
                             | (sizeof(      double ) << 40 )
                             | (sizeof( long double ) << 48 )
                             | 0 } ;

This should be enough to verify the type sizes and endianness, except that floating point numbers are crazy difficult for this.

If you want to verify that your floating point numbers are stored in the same format on the writer and the reader then you might want to store a couple of constant floating point numbers (more interesting than 0, 1, and -1) in the different sizes after this header, and verify that they are what you think they should be.

It is very likely that storing an actual magic string with version number would also be good as another check that this is the correct file format.

If you don't care about floats or something like that then feel free to delete them. I didn't include char because it is supposed to always be 1 byte.

It might be a good idea if you also store the sizeof some struct like:

struct misalligned {
    char c;
    uint64_t u;
};

This should allow you to easily determine the alignment and padding of the compiler that generated the code that generated the file. If this were done on most 32 bit computers that care about alignment the size would be 96 because there would be 3 bytes of padding between c and u, but if it were done on a 64 bit machine then the sizeof it may be 128, having 7 bytes of padding between c and u. If this were done on an AVR the sizeof this would most likely be 9 because there would be no padding.

NOTE

  • this answer relied on the question stating that the files were being memory mapped and no need for portability beyond recognizing that a file was the wrong format. If the question were about general file storage and retrivial I would have answered differently. The biggest difference would be packing the data structures.
nategoose
A: 

First, I fully agree with the previous answer provided by Warren Young.

This is a meta-data case we're talking about.

On a filesystem and homogeneous content, I'd prefer having one padded (to the size of a structure) meta-data at the beginning of the binary file. This allow to preserve data structure alignment and simplify append-writing.

If heterogeneous, I'd prefer using Structure-Value or Structure-Length-Value (also known as Type Length Value) in front of each data or range of data.

On a stream with random joining, you may wish to have some kind of structure sync with something like HDLC (on Wikipedia) and meta-data repetition during the constant flow of binary data. If you're familiar with audio/video format, you may think of TAGs inside a data flow which is intrinsically composed of frames.

Nice subject !

levif
+1  A: 

Call the uname(2) function (or equivalent on non-POSIX platforms) and write the sysname and machine fields from the struct utsname into a header at the start of the file.

(There's more to it than just sizes and endianness - there's also floating point formats and structure padding standards that vary too. So it's really the machine ABI that you want to assert is the same).

caf