views:

849

answers:

5

I am about to begin reading tons of binary files, each with 1000 or more records. New files are added constantly so I'm writing a Windows service to monitor the directories and process new files as they are received. The files were created with a c++ program. I've recreated the struct definitions in c# and can read the data fine, but I'm concerned that the way I'm doing it will eventually kill my application.

using (BinaryReader br = new BinaryReader(File.Open("myfile.bin", FileMode.Open)))
{
    long pos = 0L;
    long length = br.BaseStream.Length;

    CPP_STRUCT_DEF record;
    byte[] buffer = new byte[Marshal.SizeOf(typeof(CPP_STRUCT_DEF))];
    GCHandle pin;

    while (pos < length)
    {
        buffer = br.ReadBytes(buffer.Length);
        pin = GCHandle.Alloc(buffer, GCHandleType.Pinned);
        record = (CPP_STRUCT_DEF)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(CPP_STRUCT_DEF));
        pin.Free();

        pos += buffer.Length;

        /* Do stuff with my record */
    }
}

I don't think I need to use GCHandle because I'm not actually communicating with the C++ app, everything is being done from managed code, but I don't know of an alternative method.

+4  A: 

For your particular application, only one thing will give you the definitive answer: Profile it.

That being said here are the lessons I've learned while working with large PInvoke solutions. The most effective way to marshal data is to marshal fields which are blittable. Meaning the CLR can simple do what amounts to a memcpy to move data between native and managed code. In simple terms, get all of the non-inline arrays and strings out of your structures. If they are present in the native structure, represent them with an IntPtr and marshal the values on demand into managed code.

I haven't ever profiled the difference between using Marshal.PtrToStructure vs. having a native API dereference the value. This is probably something you should invest in should PtrToStructure be revealed as a bottleneck via profiling.

For large hierarchies marshal on demand vs. pulling an entire structure into managed code at a single time. I've run into this issue the most when dealing with large tree structures. Marshalling an individual node is very fast if it's blittable and performance wise it works out to only marshal what you need at that moment.

JaredPar
+2  A: 

Using Marshal.PtrToStructure is rather slow. I found the following article on CodeProject which is comparing (and benchmarking) different ways of reading binary data very helpful:

Fast Binary File Reading with C#

0xA3
+1  A: 

This may be outside the bounds of your question, but I would be inclined to write a little assembly in Managed C++ that did an fread() or something similarly fast to read in the structs. Once you've got them read in, you can use C# to do everything else you need with them.

glaxaco
+1  A: 

In addition to JaredPar's comprehensive answer, you don't need to use GCHandle, you can use unsafe code instead.

fixed(byte *pBuffer = buffer) {
     record = *((CPP_STRUCT_DEF *)pBuffer);
}

The whole purpose of the GCHandle/fixed statement is to pin/fix the particular memory segment, making the memory immovable from GC's point of view. If the memory was movable, any relocation would render your pointers invalid.

Not sure which way is faster though.

arul
Thanks for the suggestion. I'm going to profile like Jarred suggested, but I'll also profile using this method.
scottm
A: 

here's a small class i made a while back while playing with structured files. it was the fastest method i could figure out at the time shy of going unsafe (which was what i was trying to replace and maintain comparable performance.)

using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.InteropServices;

namespace PersonalUse.IO {

 public sealed class RecordReader<T> : IDisposable, IEnumerable<T> where T : new() {

  const int DEFAULT_STREAM_BUFFER_SIZE = 2 << 16; // default stream buffer (64k)
  const int DEFAULT_RECORD_BUFFER_SIZE = 100; // default record buffer (100 records)

  readonly long _fileSize; // size of the underlying file
  readonly int _recordSize; // size of the record structure
  byte[] _buffer; // the buffer itself, [record buffer size] * _recordSize
  FileStream _fs;

  T[] _structBuffer;
  GCHandle _h; // handle/pinned pointer to _structBuffer 

  int _recordsInBuffer; // how many records are in the buffer
  int _bufferIndex; // the index of the current record in the buffer
  long _recordPosition; // position of the record in the file

  /// <overloads>Initializes a new instance of the <see cref="RecordReader{T}"/> class.</overloads>
  /// <summary>
  /// Initializes a new instance of the <see cref="RecordReader{T}"/> class.
  /// </summary>
  /// <param name="filename">filename to be read</param>
  public RecordReader(string filename) : this(filename, DEFAULT_STREAM_BUFFER_SIZE, DEFAULT_RECORD_BUFFER_SIZE) { }

  /// <summary>
  /// Initializes a new instance of the <see cref="RecordReader{T}"/> class.
  /// </summary>
  /// <param name="filename">filename to be read</param>
  /// <param name="streamBufferSize">buffer size for the underlying <see cref="FileStream"/>, in bytes.</param>
  public RecordReader(string filename, int streamBufferSize) : this(filename, streamBufferSize, DEFAULT_RECORD_BUFFER_SIZE) { }

  /// <summary>
  /// Initializes a new instance of the <see cref="RecordReader{T}"/> class.
  /// </summary>
  /// <param name="filename">filename to be read</param>
  /// <param name="streamBufferSize">buffer size for the underlying <see cref="FileStream"/>, in bytes.</param>
  /// <param name="recordBufferSize">size of record buffer, in records.</param>
  public RecordReader(string filename, int streamBufferSize, int recordBufferSize) {
   _fileSize = new FileInfo(filename).Length;
   _recordSize = Marshal.SizeOf(typeof(T));
   _buffer = new byte[recordBufferSize * _recordSize];
   _fs = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.None, streamBufferSize, FileOptions.SequentialScan);

   _structBuffer = new T[recordBufferSize];
   _h = GCHandle.Alloc(_structBuffer, GCHandleType.Pinned);

   FillBuffer();
  }

  // fill the buffer, reset position
  void FillBuffer() {
   int bytes = _fs.Read(_buffer, 0, _buffer.Length);
   Marshal.Copy(_buffer, 0, _h.AddrOfPinnedObject(), _buffer.Length);
   _recordsInBuffer = bytes / _recordSize;
   _bufferIndex = 0;
  }

  /// <summary>
  /// Read a record
  /// </summary>
  /// <returns>a record of type T</returns>
  public T Read() {
   if(_recordsInBuffer == 0)
    return new T(); //EOF
   if(_bufferIndex < _recordsInBuffer) {
    // update positional info
    _recordPosition++;
    return _structBuffer[_bufferIndex++];
   } else {
    // refill the buffer
    FillBuffer();
    return Read();
   }
  }

  /// <summary>
  /// Advances the record position without reading.
  /// </summary>
  public void Next() {
   if(_recordsInBuffer == 0)
    return; // EOF
   else if(_bufferIndex < _recordsInBuffer) {
    _bufferIndex++;
    _recordPosition++;
   } else {
    FillBuffer();
    Next();
   }
  }

  public long FileSize {
   get { return _fileSize; }
  }

  public long FilePosition {
   get { return _recordSize * _recordPosition; }
  }

  public long RecordSize {
   get { return _recordSize; }
  }

  public long RecordPosition {
   get { return _recordPosition; }
  }

  public bool EOF {
   get { return _recordsInBuffer == 0; }
  }

  public void Close() {
   Dispose(true);
  }

  void Dispose(bool disposing) {
   try {
    if(disposing && _fs != null) {
     _fs.Close();
    }
   } finally {
    if(_fs != null) {
     _fs = null;
     _buffer = null;
     _recordPosition = 0;
     _bufferIndex = 0;
     _recordsInBuffer = 0;
    }
    if(_h.IsAllocated) {
     _h.Free();
     _structBuffer = null;
    }
   }
  }

  #region IDisposable Members

  public void Dispose() {
   Dispose(true);
  }

  #endregion

  #region IEnumerable<T> Members

  public IEnumerator<T> GetEnumerator() {
   while(_recordsInBuffer != 0) {
    yield return Read();
   }
  }

  #endregion

  #region IEnumerable Members

  System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
   return GetEnumerator();
  }

  #endregion

 } // end class

} // end namespace

to use:

using(RecordReader<CPP_STRUCT_DEF> reader = new RecordReader<CPP_STRUCT_DEF>(path)) {
 foreach(CPP_STRUCT_DEF record in reader) {
  // do stuff
 }
}

(pretty new here, hope that wasn't too much to post... just pasted in the class, didn't chop out the comments or anything to shorten it.)

Sean Newton