views:

645

answers:

4

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.

In other words there are 4 operations:

1. Scan directory for jpegs  
2. Parse files for xif and generate copy-location  
3. Check for file existence and if needed generate new path  
4. Copy files

And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.

Any input would be appreciated.

+5  A: 

There are few a options:

[But as @John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]

Mitch Wheat
+2  A: 

This is the reference I use for C# thread: http://www.albahari.com/threading/

As a single PDF: http://www.albahari.com/threading/threading.pdf

For your second approach:

I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.

In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...

It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.

chocojosh
A: 

Interesting problem. I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.

The first one iterates through the files in parallel. The second one generates asynchronously the files from the directory.

Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)

string direcory = "Mydirectory";
    var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");


    // --  PLinq --------------------------------------------
    jpegFiles
    .AsParallel()
    .Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
    .Do(fileInfo => 
        {
            if (!File.Exists(fileInfo.NewLocation ) || 
                (File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
                File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
        })
    .Run();

    // -----------------------------------------------------


    //-- Rx Framework ---------------------------------------------
    var resetEvent = new AutoResetEvent(false);
    var doTheWork =
    jpegFiles.ToObservable()
    .Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
    .Subscribe( fileInfo => 
        {
            if (!File.Exists(fileInfo.NewLocation ) || 
                (File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
            File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
        },() => resetEvent.Set());

    resetEvent.WaitOne();
    doTheWork.Dispose();

    // -----------------------------------------------------
Johnny Blaze
PLinq requires .net 4.0, isn't that correct?
Alxandr
Yes it does require .Net 4.0
Johnny Blaze
A: 

A good article about parallel computing and its application in .net is here.

HotTester