tags:

views:

86

answers:

3

Data is often stored in program-specific binary files for which there is little or no documentation. A typical example in our field is data that comes from an instrument, but I suspect the problem is general. What methods are there for trying to understand and interpret the data?

To set some boundaries. The files are not encrypted and there is no DRM. The type and format of the file is specific to the writer of the program (i.e. it is not a "standard file" - such as *.tar - whose identity has been lost). There is (probably) no deliberate obfuscation but there may be some amateur efforts to save space. We can assume that we have a general knowledge of what the data is and we may recognize some, but probably not all, of the fields and arrays.

Assume that the majority of the data is numeric, with scalars, and arrays (probably 1- and 2- dimensional and sometimes irregular or triangular). There will also be some character strings, probably names of people, sites, dates and maybe some keywords. There will be code in the program that reads the binary file, but we do not have access to the source or the assembler. As an example it may have been written by a VAX Fortran program or some early Unix or by Windows as OLE objects. The numbers may be big- or little-endian (which is not known at the start) but it's probably consistent. We may have different versions on different machines (e.g. Cray).

We can assume we have a reasonably large corpus of files - some hundreds, say.

We can assume two scenarios:

  1. We can rerun the program with different inputs so we can do experiments.
  2. We cannot rerun the program - we have a fixed set of documents. This has a gentle similarity to decoding historical documents in an unknown language (e.g. Linear B).

A partial solution may be acceptable - i.e. there may be some fields that no living person now understands, but most of the others are interpretable.

I am only interested in Open Source approaches.

UPDATE There is a related SO question (http://stackoverflow.com/questions/507093/how-to-reverse-engineer-binary-file-formats-for-compatibility-purposes) but the emphasis is somewhat different. UPDATE Clever suggestion from @brianegge to address (1). Use truss (or possibly strace on Linux) to dump all write() and similar calls in the program. This should allow at least the collection of records written to disk.

+1  A: 

If you are on a system which offers truss, simply watch your system calls to write and you'll probably have a good idea. It's also possible that the program is going to mmap a file and copy directly from memory, but that's less common.

$ truss -t write echo foo
foowrite(1, " f o o", 3)                                = 3
write(1, "\n", 1)                               = 1

It also may make sense to take a look at the binary. On Unix systems, you can use objdump to view the layout of the binary. This will point to the code and data sections. You can then open the binary is a hex editor and go to the specific offsets. You may be interested in my tips for Solaris binary files.

brianegge
Thanks. Just to clarify it's the output data, not the code that I'm interested in. However, of course, the `strings` untility and similar may have some use.
peter.murray.rust
A: 

all files have a header. Start from there, see what similarities you have between 2 files, eliminate common "signatures" and work with the differences. They should mark the number of records, export date and similar things.

Common parts between the two headers may just be considered general signatures and i guess you can ignore them

Quamis
Are there any free utilities for simplifying this?
peter.murray.rust
A: 

I was hoping there was a magic utility that could work out patterns, try different endianness etc. But there doesn't seem to be!

peter.murray.rust