views:

217

answers:

6

I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

Here's the recursive method I'm using now:

    private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
    {
        DirectoryInfo di = new DirectoryInfo(rootFolderPath);

        FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
        files.AddRange(fiArr);

        DirectoryInfo[] diArr = di.GetDirectories();

        foreach (DirectoryInfo info in diArr)
        {
            GetFileList(fileSearchPattern, info.FullName, files);
        }
    }

I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?


Update 1

I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

    public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
    {
        DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);

        List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
        dirList.Add(rootDir);

        List<FileInfo> fileList = new List<FileInfo>();

        foreach (DirectoryInfo dir in dirList)
        {
            fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
        }

        return fileList; 
    }

Update 2

Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

  • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
  • GetFileListB(): Recursive method from the original question
  • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
  • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
    • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
  • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459

. . .so looks like Marc's is the fastest.

A: 

I'd be inclined to return an IEnumerable<> in this case -- depending on how you are consuming the results, it could be an improvement, plus you reduce your parameter footprint by 1/3 and avoid passing around that List incessantly.

private IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
 DirectoryInfo di = new DirectoryInfo(rootFolderPath);

 var fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
 foreach(FileInfo fi in fiArr)
 {
  yield return fi;
 }

 var diArr = di.GetDirectories();

 foreach(DirectoryInfo di in diArr)
 {
  var nextRound = GetFileList(fileSearchPattern,di.FullnName);
  foreach(FileInfo fi in nextRound)
  {
   yield return fi;
  }
 }
    yield break;
}

Another idea would be to spin off BackgroundWorker objects to troll through directories. You wouldn't want a new thread for every directory, but you might create them on the top level (first pass through GetFileList()), so if you execute on your C:\ drive, with 12 directories, each of those directories will be searched by a different thread, which will then recurse through subdirectories. You'll have one thread going through C:\Windows while another goes through C:\Program Files. There are a lot of variables as to how this is going to affect performance -- you'd have to test it to see.

Jay
A: 

You can use parallel foreach (.Net 4.0) or you can try Poor Man's Parallel.ForEach Iterator for .Net3.5 . That can speed-up your search.

cornerback84
A: 

Consider splitting the updated method into two iterators:

private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
{
     DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
     yield return rootDir;

     foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
     {
          yield return di;
     }
     yield break;
}

public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
     var allDirs = GetDirs(rootFolderPath);
     foreach(DirectoryInfo di in allDirs())
     {
          var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
          foreach(FileInfo fi in files)
          {
               yield return fi;
          }
     }
     yield break;
}

Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute on the server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAY faster at identifying the files that were new, deleted, and out-of-sync.

Jay
+1  A: 

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

Marc Gravell
This yield/return stuff is new to me. If I save the results of your method to a IEnumerable<string> variable, and run multiple ForEach iterations over the variable, will each ForEach run off some cache of all the values or will each ForEach re-run GetFileList()?
Eric Anastas
@Eric - It will re-run it, but in that scenario buffer it, with either `.ToList()` (LINQ), or new `List<string>(GetFileList(...))`
Marc Gravell
+1  A: 

Cool question.

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

Here is the meat of it

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
  {
   var rootDir = new DirectoryInfo(rootFolderPath);
   var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

   return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
          select directoriesWithFiles;
  }

  private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
  {
   foreach (DirectoryInfo dir in dirList)
   {
    yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
   }
  }
Foovanadil
Eric, don't forget that when testing, you have to traverse the IEnumerable<FileInfo> collection to get this code to actually execute. You should at least echo the filename to console to compare the two approaches, otherwise you've got an apples and oranges comparison when the iterator method returns so fast but execution is deferred.
Jay
What about just populating a List<FileINfo>?List<FileInfo> myList = new List<FileINfo>(GetFileLIst(some stuff));
Eric Anastas
Jay is correct, of course you need to traverse the collection to see the performance hit. I purposely omitted how I was echoing to the console to calculate the timing assuming you may have been doing something specific or calculating timing more precisely.This is what I did (the call to .Count will cause the traversal):var start = DateTime.Now;IEnumerable<FileInfo> fileList = GetFileList("*.txt", @"C:\Temp");var end = DateTime.Now;Console.WriteLine(String.Format("Files Found: {0}", fileList.Count()));Console.WriteLine(String.Format("Time: {0}", end.Subtract(start).TotalMilliseconds));
Foovanadil
Ran out of chars to answer your comment Eric. But yes putting the resulting IEnum into a List<T> will cause a list traversal too
Foovanadil
A: 

The short answer of how to improve the performance of that code is: You cant.

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

    public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
    {
        return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
    }

    public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
    {
        return depth == 0
            ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
            : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
                directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
    }

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

Paul Rohde