views:

911

answers:

10

I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. The files currently are many instances of a POD struct, written with fwrite.

I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. Since I'm doing this, I want to add some other information as well. I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. Is there anything else that would be useful to add?

+3  A: 

An identifier for the type of the file would be useful if you will have other structures written to binary files later on. Maybe this could be a short string so you can see by a look into the file (via hex editor) what it contains.

rstevens
A: 

If you are putting a version number in the header you can change that version anytime you need to change the POD struct or add new fields to the header.

So don't add stuff to the header now because it might be interesting. You are just creating code that you have to maintain but that has little real value.

ewalshe
+1  A: 

In addition to whatever information you need for schema versioning, add details that may be of value if you are troubleshooting an issue. For example:

  • timestamps of when the file was created and update (if applicable).
  • the version string from the build (ideally you have a version string that is auto-incremented on every 'official' build ... this is different to the file schema version).
  • the name of the system creating the file, and maybe other statistics that are relevant to your app

We find this is very useful (a) in getting information we would otherwise have to ask the customer to provide and (b) getting correct information -- it is amazing how many customers report they are running a different version of the software to what the data claims!

Rob Walker
+3  A: 

For large binaries, in addition to the version number I tend to put a record count and CRC, the reason being that large binaries are much more prone to get truncated and/or corrupted over time or during transfer than smaller ones. I found recently to my horror that Windows does not handle this well at all, as I used explorer to copy about 2TB across a couple of hundred files to an attached NAS device, and found 2-3 files on each copy were damaged (not completely copied).

Shane MacLaughlin
A record count is a good idea as another check. The data isn't moved much, so a CRC might be overkill. On the other hand, it would be easy to calculate when the file is written, and need not be checked every time the file is read. We could have a stand alone utility for that.
KeithB
My technique (see below) of storing metadata at the file END is a quick catch for this problem.
Roddy
I tend to store metadata at the start rather than at the end because in my experience the end of the file is more liable to be lost. If this is the case, the oppurtunities for partially recovering corrupt data are improved with the metadata at the start.
Shane MacLaughlin
Additionally, placing metadata at the end of the file is likely to make appending to a file either slower or more complex.
Shane MacLaughlin
+1  A: 

@rstevens said 'an identifier for the type of file'...sound advice. Conventionally, that's called a magic number and, in a file, isn't a term of abuse (unlike in code, where it is a term of abuse). Basically, it is some number - typically at least 4 bytes, and I usually ensure that at least one of those bytes is not ASCII - that you can use to validate that the file is of the type you expect with a low probability of being confused. You can also write a rule in /etc/magic (or local equivalent) to report that files containing your magic number are your special file type.

You should include a file format version number. However, I would recommend not using the SVN number of the code. Your code may change when the file format does not.

Jonathan Leffler
I was going to include both file format version and SVN version of the file creator, so that if we later find a bug in certain versions of the creator, it is easy to figure out which data files are affected.
KeithB
OK - that works too (in fact, it is quite a good idea). But the key point is that there is a version number for the file format which is separate from the version number of the code that wrote the file in the given format.
Jonathan Leffler
Thats correct. Thanks for making that point clear.
KeithB
Magic number sounds good (better than my naive solution :-)
rstevens
A: 

For large files, you might want to add data definitions, so your file format becomes self-describing.

Stephan Eggermont
This is a good idea in general. In our specific case, a database is the "official" version of the data. The binary files are only used to feed data into the simulation.
KeithB
+3  A: 

In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.

I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:

  • Truncated/unterminated files are easily detected.
  • Metadata footers can often be appended to existing files without impacting their reading code.

The simplest metadata footer I use looks something like this:

struct MetadataFooter{
  char[40] creatorVersion;
  char[40] creatorApplication;
  .. or whatever
} 

struct FileFooter
{
  int64 metadataFooterSize;  // = sizeof(MetadataFooter)
  char[10] magicString;   // a unique identifier for the format: maybe "MYFILEFMT"
};

After the raw data, the metadata footer and THEN the file footer are written.

When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.

As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.

Roddy
This is an interesting approach I hadn't though about. You could even make the MetadataFooter an XML string, and get all of the benefits of a binary data file, and still have an easily extensible scheme for storing metadata.
KeithB
@KeithB : Ah! That's a technique I hadn't considered. I like that ;-)
Roddy
+3  A: 

Hi

For large binaries I'd look seriously at HDF5 (Google for it). Even if it's not something you want to adopt it might point you in some useful directions in designing your own formats.

Regards

Mark

High Performance Mark
I've heard about HDF5, but never had time to look into it. It seems to be the standard for scientific computing. Its probably overkill for what we need, since our data is just multiple copies of one struct. For something more complex, I would consider it.
KeithB
I've used it for simple stuff and it's very useful. (Though i do wish they had a simpler Java binding.) They give you standard tools + MATLAB can parse it.
Jason S
+1  A: 

You might consider putting a file offset in a fixed position in the header, which tells you where the actual data begins in the file. This would let you change the size of the header when needed.

In a couple of cases, I put the value 0x12345678 into the header so I could detect if the file format, matched the endianism of the machine that was processing it.

EvilTeach
+2  A: 

If they're that large, I'd reserve a healthy chunk (64K?) of space at the beginning of the file and put the metadata there in XML format followed by an end-of-file character (Ctrl-Z for DOS/Windows, ctrl-D for unix?). That way you can examine and parse the metadata easily with the wide range of toolsets out there for XML.

Otherwise I go with what other people have already said: timestamp for file creation, identifier for which machine it's created on, basically anything else that you can think of for diagnostic purposes. And ideally you would include the definition of the structure format itself. If you are changing the structure often, it's a big pain to maintain the proper version of code around to read various formats of old datafiles.

One big advantage of HDF5 as @highpercomp has mentioned, is that you just don't need to worry about changes in the structure format, as long as you have some convention of what the names and datatypes are. The structure names and datatypes are all stored in the file itself, so you can blow your C code to smithereens and it doesn't matter, you can still retrieve data from an HDF5 file. It lets you worry less about the format of data and more on the structure of data, i.e. I don't care about the sequence of bytes, that's HDF5's problem, but I do care about field names and the like.

Another reason I like HDF5 is you can choose to use compression, which takes a very small amount of time and can give you huge wins in storage space if the data is slowly-changing or mostly the same except for a few errant blips of interestingness.

Jason S