views:

237

answers:

3

I need to read small sequences of data from a 3.7 GB file. The positions I need to read are not adjacent, but I can order the IO so that the file is read from beginning to end.

The file is stored on a iSCSI SAN which should be capable of handling/optimizing queued IO.

The question is, how can I make a one shot request of all the data/positions I need in one go? Is it possible? I don't think async IO is an option because the reads are very small (20-200 bytes)

Currently the code looks like this:

using (var fileStream = new FileStream(dataStorePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
    for (int i = 0; i < internalIds.Count();i++ )
    {
        fileStream.Position = seekPositions[i].SeekPosition;
        ... = Serializer.DeserializeWithLengthPrefix<...>(fileStream, PrefixStyle.Base128);

    }
    ...
}

I'm looking for ways to improve this I/O because I'm getting somewhat sub-par read performance. All the seek times from moving the head seem to be adding up.

A: 

Make a single background thread as a disk proxy. Send all your read operations to it, and have it sort and merge the reads. If two or more regions are close, then read the full sector containing them and take sub-sections of the data. Return the data asynchronously.

280Z28
The reads are already ordered, and FileStream itself already does this kind of buffering by default - reason why the performance is not entirely terrible. See the following link for confirmation that buffering does indeed happen: http://blogs.msdn.com/brada/archive/2004/04/15/114329.aspx
legenden
+1  A: 

Have you run Performance Monitor (from Microsoft Sysinternals) on this?

I'm not sure what the problem is, but I'll take a guess. If you're reading from a SAN, I would think disk accesses result in network requests under the hood. The first read sends a request to seek, reads and buffers data, and then the Serializer constructs the objects. By the time your second request gets sent, the SAN disks have continued to spin, so you have to wait for the data to spin into place.

Have you tried multithreading? I'm curious about the performance if you setup a Queue of file sections you need to process in sequential order, spin up a some threads, have them open the file separately (FileSharing.Read so they can all access the file at once) and then let them start grabbing work from the Queue. Output the results into another collection. If the order matters for the output, you sort the output by the original order in which you queued them.

--- EDIT ---

Have you tried the ReadFileScatter API? Here's a P-invoke signature from pinvoke.net.

Paul Williams
+1 for understanding the question. I believe that's exactly what's happening, by the time the second read needs to be done, the disks have already spinned, hence why I'm looking into a way to do hardware queueing.
legenden
I would have thought that Windows handled hardware queuing for you. You certainly can't get medieval with the hard drive on base C#. You can only say "go here and read X bytes". I'd experiment with different patterns of access with multiple threads. Maybe it would be faster if 2 threads read A and B, then C and D; or maybe A and M, then B and N.
Paul Williams
The ReadFileScatter API sounds promising. Added a blurb to my answer.
Paul Williams
A: 

Just for the record:

In POSIX Environments you could request multiple areas of a file with one (sys-)call using the readv function. Another ption in a POSIX Environment would be non-blocking IO.

dmeister