tags:

views:

155

answers:

5

I am trying to loop through all files and folders and perform an action on all files that have a certain extension. This method works fine, but I would like to make it multithreaded because when done over tens of thousands of files, it is really slow and I would imaging using multithreading would speed things up. I am just unsure about how to use threading in this case.

doStuff reads properties (date modified, etc. from the files and inserts them into a sqlite database. I am starting a transaction before the scan method is called so that is optimized as much as it can be.

Answers that provide the theory on how to do it are just as good as full working code answers.

    private static string[] validTypes = { ".x", ".y", ".z", ".etc" };
    public static void scan(string rootDirectory)
    {
        try
        {

            foreach (string dir in Directory.GetDirectories(rootDirectory))
            {

                if (dir.ToLower().IndexOf("$recycle.bin") == -1)
                    scan(dir);
            }

            foreach (string file in Directory.GetFiles(rootDirectory))
            {

                if (!((IList<string>)validTypes).Contains(Path.GetExtension(file)))
                {
                    continue;
                }


                doStuff(file);
            }
        }
        catch (Exception)
        {
        }
    }
+1  A: 

What exactly does doStuff and scan do? Unless they're very CPU intensive I would have thought that the disk access would be the bottleneck and that if anything making it multithreaded might be slower.

ho1
`doStuff` reads properties (date modified, etc.) from the files and inserts them into a sqlite database. I am starting a transaction before the scan method is called so that is optimized as much as it can be.
Ramblingwood
@Ramblingwood: You could try out having just 2 threads to begin with then, one that reads all the file details into memory and the other that uses that info to write to the DB. Then you could measure how much time each spends processing and make sure you optimize the right thing.
ho1
@ho1 that sounds like a good idea. I'll try that.
Ramblingwood
@Ramblingwood: Take a look at .NET 4.0's blocking queue (http://msdn.microsoft.com/en-us/library/dd267312(VS.100).aspx). The nice thing about a producer/consumer architecture is that you can easily change the number of threads on either end, for tuning.
Steven Sudit
+3  A: 

Assuming that doStuff is thread-safe, and that you don't need to wait for the entire scan to finish, you can call both doStuff and scan on the ThreadPool, like this:

string path = file;
ThreadPool.QueueUserWorkItem(delegate { doStuff(path); });

You need to make a separate local variable because the anonymous method would have capture the file variable itself, and would see changes to it throughout the loop. (In other words, if the ThreadPool only executed the task after the loop continued to the next file, it would process the wrong file)

However, reading your comment, the main issue here is disk IO, so I suspect that multithreading will not help much.

Note that Directory.GetFiles will perform slowly for directories with large numbers of files. (Since it needs to allocate an array to hold of the filenames)
If you're using .Net 4.0, you can make it faster by calling the EnumerateFiles method instead, which uses an iterator to return a IEnumerable<string> that enumerates the directory as you run your loop.
You can also avoid the recursive scan calls with either method by passing the SearchOption parameter, like this:

foreach (string file in Directory.EnumerateFiles(rootDirectory, "*", SearchOption.AllDirectories))

This will recursively scan all subdirectories, so you'll only need a single foreach loop.
Note that this will exacerbate the performance issues with GetFiles, so you may want to avoid this pre-.Net 4.0.

SLaks
This is the right way to go, but you might also want multiple threads doing the searching.
Steven Sudit
@Steven: I already said that. (In the first paragraph)
SLaks
so even for just reading files, threading won't help? I think it is worth a shot trying this though.
Ramblingwood
@SLaks: Sorry, the way I understood it, you're suggesting that a single thread does the searching, queuing up `doStuff` for each file it finds. If so, I would suggest having multiple search threads. The idea would be to do your own recursion. As for disk I/O and multithreading, see my response to Eric.
Steven Sudit
hmm. the directories shouldn't have more that 20 or 30 files in them so that shouldn't be a problem. That last foreach thing you posted seems pretty interesting, and combined with Dan's comment below I guess I will have to drop multithreading all together.
Ramblingwood
As I said in my first sentence, `you can call both doStuff and scan on the ThreadPool`
SLaks
I think this be the best way to improve performance--even if it is just a little bit. Thanks also to Steven who made that very useful comment in response to Eric above.
Ramblingwood
@SLaks: Indeed you did. I was thrown off by your last sentence, though, where you recommended `SearchOption.AllDirectories`, which would remove the ability to parallelize the scan.
Steven Sudit
+1  A: 

Using multithreading on IO operations is generally a bad call*. You may have multiple CPUs or a CPU with multiple cores; but generally, your hard disk cannot read or write to multiple files at the same time. This sort of thing typically needs to be serialized.

That said, it's a good practice to perform this kind of work on a thread that's separate from your UI thread. That way the UI remains responsive while your app is doing its heavy lifting.

*I am assuming your scan and doStuff methods are actually reading and/or writing data on the hard disk. If this is not the case, parallelizing this code might make sense after all.

Dan Tao
i am not modifying anything fortunately, just reading.
Ramblingwood
@Ramblingwood: Are you reading the *contents* of files, or just looking at the properties of paths and/or `DirectoryInfo`/`FileInfo` objects? Reading from the hard disk is also not possible to do multithreaded on most systems.
Dan Tao
I see. I was hoping to expand in the future to actually reading from the hard disk so I guess I shouldn't multithread this.
Ramblingwood
+1  A: 

On a side note, there's no need to cast validTypes to IList<string> because arrays implement IEnumerable<T> in .net 3.5+.

Secondly, validTypes might be better implemented as a HashSet, giving you O(1) lookup instead of O(n) with Contains. That said, this probably won't impact the performance in this case because your application is IO bound, as pointed out by the other answers.

Winston Smith
A: 

Thanks for everyone who responded. What I ended up going with was

        foreach (string file in Directory.EnumerateFiles(rootDirectory, "*", SearchOption.AllDirectories))
        {
            if (!((IList<string>)validTypes).Contains(Path.GetExtension(file)))
            {
                continue;
            }
            string path = file;
            ThreadPool.QueueUserWorkItem(delegate { doStuff(path); });
        }

This ran in about 2 minutes compared to the multiple hours that it was taking before. I think most of the lag was in the database, not the file IO.

Thanks so much everyone!

Ramblingwood
I'm glad that worked out for you.
Steven Sudit