views:

131

answers:

2

Hi,

In the past I have used ThreadPool.QueueUserWorkItem to spawn multiple threads from a manager class. This manager class subscribes to an event in these spawned threads which is raised when the thread work completes. The manager class can then handle writing output to a text file by using a lock to prevent any race conditions.

Now I am using Parallel.ForEach to do the work. What is my best method for writing all output to a text file in a thread safe manner?

The basic outline of my implementation:

public class Directory
{
    public string Path;

    public Directory(string path)
    {
        Path = path;
    }

    public void Scan()
    {
        Parallel.ForEach(new DirectoryInfo(Path).GetDirectories(),
                         delegate(DirectoryInfo di)
                         {
                             var d = new Directory(di.FullName);
                             d.Scan();
                             //Output to text file.
                         });

    }
}

Which I get going by:

new Directory(@"c:\blah").Scan();

Any ideas to point me in the right direction would be great. I have a few myself but I am looking for best practice. I have read this post but it does not contain any solution that helps me.

+1  A: 

Use EnumerateDirectories (Fx 4) instead of GetDirectories. Your current code would not work in parallel very much.

For the rest it depends on whether you need the output to be in order or not.
If you don't care about the order, you can simply lock the output stream (with a helper object), write and continue. No need for a complicating Event.
If you want to maintain order,

Push the output to a Queue. Process the queue when the ForEach is complete or start a separate Task (Consumer) to write it ASAP. This would be a typical Producer/Consumer pattern.

Please note that by making the processing Parallel it becomes very difficult to maintain the order in which the Directories are written.

Henk Holterman
Why does he need a queue? Can't he just do `AsOrdered()` to get everything back in the original order?
Gabe
@GAbe: I don't see how. Processing the Dirs will vary in time used. EnumDirs already is ordered (and not parallel).
Henk Holterman
Henk: I'm imagining somethingl like this: `outfile.Write(EnumDirs().AsParallel().Select(f => Scan(f)).AsOrdered())` so that the `Scan` processing is in parallel but the output is in enumeration order.
Gabe
@Gabe: Try it, I would like to see a working sample. But AFAIK Asordered() won't work on (after) Select()
Henk Holterman
Henk: You're right; I put it in the wrong place. `Directory.EnumerateFileSystemEntries(@"C:\").AsParallel().AsOrdered().Select(x => { Thread.Sleep(10); return x; })` returns the same results as `Directory.EnumerateFileSystemEntries(@"C:\").Select(x => { Thread.Sleep(10); return x; })` only much faster; it also runs as fast as `Directory.EnumerateFileSystemEntries(@"C:\").AsParallel().Select(x => { Thread.Sleep(10); return x; })` which returns results in arbitrary order.
Gabe
A: 

For starters I would separate the concept of enumerating the files from the concept of processing them.

Perhaps make your Directory class implement IEnumerable<FileInfo> and lazily enumerate all the files using recursion, EnumerateDirectories and EnumerateFiles. (See http://msdn.microsoft.com/en-us/library/dd997370.aspx).

Now you can deal with the issue of consuming that IEnumerable and processing it without mingling the code to recurse the directories.

Create the output stream. Enumerate the IEnumerable<FileInfo> and fire off a Task for each: See http://msdn.microsoft.com/en-us/library/dd321424.aspx. Within that Task, after reading the file and creating the output string, lock() and write to the output stream.

Alternatively, and perhaps somewhat cleaner, start a separate consumer Task that does the writing and use a BlockingCollection to pass data between the producers and the consumers (see http://msdn.microsoft.com/en-us/library/dd267312.aspx).

When you create the producer tasks you may want to pass in options to limit the maximum degrees of parallelism because disk-thrashing isn't something the current Task scheduler is looking for when it's adding threads to get work done.

See also http://reedcopsey.com/2010/03/17/parallelism-in-net-part-14-the-different-forms-of-task/ and all of Reed's other blog entries on TPL.

See also the efforts to link TPL and RX, e.g. http://blogs.msdn.com/b/pfxteam/archive/2010/04/04/9990349.aspx which will provide an even cleaner syntax for producing and consuming in a situation like this.

Hightechrider