views:

265

answers:

5

I have a scenario where I need to search from many binary files (using keys) and combine the results (strings). Until now, I have been doing it in a for loop one file after the other.

foreach (string file in FileSources.Keys)
{
    aggregatedDefinitions.Append(DefinitionLookup(txtSearchWord.Text, file));
}

Since this operation is very slow, I was thinking of using threads, so that I could do IO operations in parallel. Is threading the right way to go. If I use threading, how can I ensure that I get the results in the order I want.

I haven't used Threading until now. It would be very helpful if you could suggest some materials/books that would help me solve my problem.

A: 

If you are using .NET 4.0 you might want to look into Parallel Extensions and the Parallel class. I've written some examples on how to use it in .NET 4.0 with C#.

You might also want to look into Parallel IO in F# ( Read Don Symes WebLog ). The parts you have that need IO Parallized, you might want to write in F#.

Filip Ekberg
+4  A: 

Generally speaking, it's advised to use threading for I/O operations when you're performing one I/O operation on a separate thread from an application's main (typically, GUI) thread. To split many I/O operations off on separate threads in parallel probably isn't going to help you, since the disk can only be accessed by one at a time.

Dan Tao
+2  A: 

I second the opinion of Dan, too and Fredrik and to add to it - an attempt to multithread IO against a single disk can potentially instead of improving performance make things worse.

Access requests from parallel threads can increase disk thrashing which will make data retrieval from the disk slower than it is now

mfeingold
A: 

Check memory mapped files in .Net 4.0 , if you are using C# 3.5 check the pinvoke implementations for the topic , it really speeds up the io operations and general performance of your application. I have an application which calculates md5 on the given folder to find duplicates and uses memory mapped files for file access. If you need the sample sourcecode and pinvoked memory mapping libraries contact me.

http://en.wikipedia.org/wiki/Memory-mapped_file Or Check The Implementation Here http://www.pinvoke.net/default.aspx/kernel32.createfilemapping

It will really speedup your io operations without additional threading overhead.

Kerem Kusmezer
+3  A: 

Taking into account the concerns voiced by others about attempting parallel I/Os against a single disk device, it does look like your processing model can be broken up into parts. You have a Filesources.Keys input list, and the output appears to be simply appending the compute results to aggregatedDefinitions.

Here's how you can break that up for processing on multiple threads and preserve the order of your current results:

First, decide how many threads you are going to use. For compute intensive tasks, there is usually no point in spinning up more threads than you have CPU cores. For I/O bound tasks, you can use more threads than CPU cores since the threads will be spending most of their time waiting for I/O completion.

Let's assume your DefinitionLookup is compute intensive, not I/O intensive, and let's assume you're running on a dual core CPU. In these conditions, two threads would be a good choice.

Next, break the input up into largish chunks, preserving the order of the inputs. For our two thread scenario, send the first half of the FileSources.Keys list to the first thread, and the second half to the second thread.

In each thread, process the inputs as before, but append the output to a local list object, not the final (shared) aggregatedDefinitions list.

After the threads have finished their processing, have the main thread concatenate each thread's list results into the final aggregatedDefinitions list, in the correct order. (Thread 1 that received the first half of the inputs produces list1, and should be appended to the master list before the output of Thread2's results.

Something like this:

    static void Mainthread()
    {
        List<string> input = new List<string>();  // fill with data

        int half = input.Count() / 2;
        ManualResetEvent event1 = new ManualResetEvent(false);
        List<string> results1 = null;

        // give the first half of the input to the first thread
        ThreadPool.QueueUserWorkItem(r => ComputeTask(input.GetRange(0, half), out results1, event1));

        ManualResetEvent event2 = new ManualResetEvent(false);
        List<string> results2 = null;

        // second half of input to the second thread
        ThreadPool.QueueUserWorkItem(r => ComputeTask(input.GetRange(half + 1, input.Count() - half), out results2, event2));

        // wait for both tasks to complete
        WaitHandle.WaitAll(new WaitHandle[] {event1, event2});

        // combine the results, preserving order.
        List<string> finalResults = new List<string>();
        finalResults.AddRange(results1);
        finalResults.AddRange(results2);
    }

    static void ComputeTask(List<string> input, out List<string> output, ManualResetEvent signal)
    {
        output = new List<string>();
        foreach (var item in input)
        {
            // do work here
            output.Add(item);
        }

        signal.Set();
    }

Also, even if all the I/O activity is hitting one disk drive, you could get some performance benefit using asynchronous file reads. The idea is you could issue the next file read request as soon as you receive the data from the previous file read request, process the data of the previous read, then wait for the completion of the next file read. This allows you to use the CPU for processing while the disk I/O request is being handled, without explicitly using threads yourself.

Compare these (pseudo) execution timelines to read and process 4 chunks of data. Assume a file read takes about 500 time units to complete, and processing that data takes about 10 time units.

Synchronous file I/O:  
read (500)
process data (10)
read (500)
process data (10)
read (500)
process data (10)
read (500)
process data (10)
Total time: 2040 time units

Async file I/O:
begin async read 1
async read 1 completed (500)
begin async read 2 / proces data 1 (10)
async read 2 completed (500)
begin async read 3 / proces data 2 (10)
async read 3 completed (500)
begin async read 4 / proces data 3 (10)
async read 4 completed (500)
process data 4 (10)
Total time: 2010 time units

The processing of data 1, 2 and 3 happens during the time that the next read request is pending, so compared to the first execution timeline, you get the processing time essentially for free. The processing of the last data chunk adds to the total time because there is no read operation for it to run concurrent with.

The scale of these operations (500 for I/O, 10 for compute) is conservative. Real I/O's tend to be even larger compared to compute time, many orders of magnitude higher than compute. As you can see, when the compute operation is pretty quick you don't get a lot of performance benefit out of all this work.

You can get greater value out of the effort of doing async I/O's if what you're doing in the "free" time is substantial. Cryptography or image processing, for example, would likely be a win but string concatenation probably would not be worth it. Writing data to another file could be worthwhile in the async overlap, but as others have noted the benefits will be diminished if all I/O is on the same physical device.

dthorpe