views:

965

answers:

8

Hi,

I've written an application in C# that moves jpgs from one set of directories to another set of directories concurrently (one thread per fixed subdirectory). The code looks something like this:

        string destination = "";
        DirectoryInfo dir = new DirectoryInfo("");
        DirectoryInfo subDirs = dir.GetDirectories();
        foreach (DirectoryInfo d in subDirs)
        {
            FileInfo[] files = subDirs.GetFiles();
            foreach (FileInfo f in files)
            {
                f.MoveTo(destination);
            }
        }

However, the performance of the application is horrendous - tons of page faults/sec. The number of files in each subdirectory can get quite large, so I think a big performance penalty comes from a context switch, to where it can't keep all the different file arrays in RAM at the same time, such that it's going to disk nearly every time.

There's a two different solutions that I can think of. The first is rewriting this in C or C++, and the second is to use multiple processes instead of multithreading.

Edit: The files are named based on a time stamp, and the directory they are moved to are based on that name. So the directories they are moved to would correspond to the hour it was created; 3-27-2009/10 for instance.

We are creating a background worker per directory for threading.

Any suggestions?

+17  A: 

Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.

If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.

jms
+1 I agree, although instead of limiting the threads manually, let the ThreadPool.QueueUserWorkItem do the work for you.
bendewey
@bendewey, really? The default threadpool will just grow slowly (a thread or two a second, right?). So after a while, you'll end up with more threads than CPUs anyways.
MichaelGG
+6  A: 

If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.

sipwiz
and trashing the disk
Kalmi
A: 

If GetFiles() is indeed returning a large set of data, you could write an enumerator, as in:

IEnumerable<string> GetFiles();
+7  A: 

Reconsidered answer

I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.

However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)

I suggest the following plan of attack:

  • Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
  • Experiment with different numbers of threads, with a queue of directories still to process.
  • Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.

Original answer

Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.

It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.

You may be interested in a benchmark (description and initial results) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (i.e. it's hardly doing any CPU work) the best results are always with a single thread.

Jon Skeet
It varies per site, but the number of threads vary from 20 to 40 or so. I've seen it take from 20MB to 100MB in memory; usually have 2GB to 4GB. Performance monitor shows the Page faults/sec usually greater than 20; I've seen it up to 10-40k before, but have no idea how accurate that is.
David Hodgson
If it's only using 100MB then it's certainly not that it can't keep all the file arrays in memory (don't forget that it's not loading the files, just the filenames). 20-40 threads certainly sounds like *way* too many. Try 1-2 instead.
Jon Skeet
Jon, can you clarify your comment about why using significantly more threads than available cores is a bad idea? The laptop I'm on has a single CPU with one core. A sampling from Task Manager shows some processes with 10 or more threads (one with 74!)...
Matt Davis
...But I don't notice any real performance problems with these. Is it really about the number of threads in relation to the number of cores or more about trying to do too many things simultaneously? Thanks.
Matt Davis
In general, you waste lots of time context switching. You should only use more cores than threads if some of them are blocked waiting for something like IO (and you should ideally avoid that, using async strategies). In this case, you've made the *disk* "context switch" a lot, which is similar.
Jon Skeet
+2  A: 

It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.

Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.

leppie
+1 - I can't stop laughing
Tony Lee
+1  A: 

the performence problem comes from the hard drive there is no point from redoing everything with C/C++ nor from multiple processes

Yassir
+1  A: 

Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (e.g. 'loading' executable code) - they're not a bad thing per se.

If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.

Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.

Will Dean
A: 

So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.

And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).

If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.

KnownIssues