views:

102

answers:

2

Using C#, I need to read a packed binary file created using FORTRAN. The file is stored in an "Unformatted Sequential" format as described here (about half-way down the page in the "Unformatted Sequential Files" section):

http://www.tacc.utexas.edu/services/userguides/intel8/fc/f_ug1/pggfmsp.htm

As you can see from the URL, the file is organized into "chunks" of 130 bytes or less and includes 2 length bytes (inserted by the FORTRAN compiler) surrounding each chunk.

So, I need to find an efficient way to parse the actual file payload away from the compiler-inserted formatting.

Once I've extracted the actual payload from the file, I'll then need to parse it up into its varying data types. That'll be the next exercise.

My first thoughts are to slurp up the entire file into a byte array using File.ReadAllBytes. Then, just iterate through the bytes, skipping the formatting and transferring the actual data to a second byte array.

In the end, that second byte array should contain the actual file contents minus all the formatting, which I'd then need to go back through to get what I need.

As I'm fairly new to C#, I thought there might be a better, more accepted way of tackling this.

Also, in case it's helpful, these files could be fairly large (say 30MB), though most will be much smaller...

+1  A: 

One way to read files like this is record by record (e.g., read the length bytes and then the data chunk, building up a list of records, which are just byte arrays). The collection of records is then passed to further parsing routines.

However, if you're on 4.0, there is a new class for file mapping which would be more efficient yet work similarly to ReadAllBytes.

If you're using ReadAllBytes or MemoryMappedFile it's nice to build an in-memory "index" into the large binary file by parsing all the record lengths first. This is especially useful if you will only need certain records.

Stephen Cleary
Thanks. Based on your comments, I've written some code that loads my file into a byte array and produces a second, clean byte array (devoid of length markers). I am now attempting to parse that up into various scalar values using BitConverter, though it seems a bit ugly as I need to maintain my own pointer into the array as I convert it. Assuming I continue with the byte array, is there a better way to get various scalars from it?Oh, and I'm not using 4.0...
Jeff Godfrey
It's possible to wrap the byte array into a `MemoryStream` and use a `BinaryReader`. The `BinaryReader` remembers its own position so you don't need to.
Stephen Cleary
A: 

Rather than iterate through the bytes, take a look at System.IO.BinaryReader. Open the file as a FileStream, wrap it in a BinaryReader, and you can read primitive types from it directly, with the stream pointer keeping track of your offset into the blob. You might have to account for endianness and custom types yourself, maybe building your own extension methods for BinaryReader on top of its method for reading individual bytes.

If you do need the data in a byte array, you can still use BinaryReader if you wrap the array in a MemoryStream first.

With files that large, I'd steer clear of File.ReadAllBytes. FileStream should buffer for you, and Stephen's suggestion for using memory-mapped files sounds like a more sophisticated (possibly more efficient) alternative to that, especially if you need to make a second pass for the formatting.

shambulator
Thanks. The problem I see with going after this directly with BinaryReader is the data is polluted with length markers (as outlined in the URL of the original post). So, I can't simply begin to read my primitives as the length markers will trip up the stream pointer.Because of that, it would seem cleaner to first scrub the data of the length markers and then process it in a second step. However, that does mean slurping the entire thing into memory first.Do you see an easy way to steer clear of the length markers and use BinaryReader in a single pass?
Jeff Godfrey
Ah, I see. Well, now that you have code for producing unpolluted data, instead of using BitConverter, you could construct a MemoryStream from each array, which takes care of the array pointer problem (MemoryStream has a constructor for wrapping existing arrays, rather than allocating its own). Then wrap the MemoryStream in a BinaryReader.
shambulator
Ah, now that looks promising (MemoryStream wrapped in a BinaryReader). Let me see what I can work out. Thank you.
Jeff Godfrey