views:

216

answers:

4

This earlier SO question talks about how to retrieve all files in a directory tree that match one of multiple extensions.

eg. Retrieve all files within C:\ and all subdirectories, matching *.log, *.txt, *.dat.

The accepted answer was this:

var files = Directory.GetFiles("C:\\path", "*.*", SearchOption.AllDirectories)
            .Where(s => s.EndsWith(".mp3") || s.EndsWith(".jpg"));

This strikes me as being quite inefficient. If you were searching on a directory tree that contains thousands of files (it uses SearchOption.AllDirectories), every single file in the specified directory tree is loaded into memory, and only then are mismatches removed. (Reminds me of the "paging" offered by ASP.NET datagrids.)

Unfortunately the standard System.IO.DirectoryInfo.GetFiles method only accepts one filter at a time.

It could be just my lack of Linq knowledge, is it actually inefficient in the way I mention?

Secondly, is there a more efficient way to do it both with and without Linq (without resorting to multiple calls to GetFiles)?

+1  A: 

You are right about the memory consumption. However, I think that's a fairly premature optimization. Loading an array of a few thousand strings is no problem at all, neither for performance nor for memory consumption. Reading a directoy containing that many files, however, is – no matter how you store/filter the file names: it will always be relatively slow.

Konrad Rudolph
For enormous numbers of files, the ideal would be GetFiles() accepts multiple filters and then walks the entire directory tree, file by file (it has to anyway), calling back to a supplied method for each match. Good points anyway.
Ash
+1  A: 

What about creating your own directory traversal function and using the C# yield operator?

EDIT: I've made a simple test, I don't know if it's exactly what you need.

class Program
{
    static string PATH = "F:\\users\\llopez\\media\\photos";

    static Func<string, bool> WHERE = s => s.EndsWith(".CR2") || s.EndsWith(".html");

    static void Main(string[] args)
    {
        using (new Profiler())
        {
            var accepted = Directory.GetFiles(PATH, "*.*", SearchOption.AllDirectories)
                .Where(WHERE);

            foreach (string f in accepted) { }
        }

        using (new Profiler())
        {
            var files = traverse(PATH, WHERE);

            foreach (string f in files) { }
        }

        Console.ReadLine();
    }

    static IEnumerable<string> traverse(string path, Func<string, bool> filter)
    {
        foreach (string f in Directory.GetFiles(path).Where(filter))
        {
            yield return f;
        }

        foreach (string d in Directory.GetDirectories(path))
        {
            foreach (string f in traverse(d, filter))
            {
                yield return f;
            }
        }
    }
}

class Profiler : IDisposable
{
    private Stopwatch stopwatch;

    public Profiler()
    {
        this.stopwatch = new Stopwatch();
        this.stopwatch.Start();
    }

    public void Dispose()
    {
        stopwatch.Stop();
        Console.WriteLine("Runing time: {0}ms", this.stopwatch.ElapsedMilliseconds);
        Console.WriteLine("GC.GetTotalMemory(false): {0}", GC.GetTotalMemory(false));
    }
}

I know that you cannot rely to much on GC.GetTotalMemory for memory profiling, but in all my test runs display a little less memory consumption around(100K).

Runing time: 605ms
GC.GetTotalMemory(false): 3444684
Runing time: 577ms
GC.GetTotalMemory(false): 3293368
Leandro López
I'll look into it.
Ash
I think it might help you to avoid loading all the file names and only retrieve those values when needed.
Leandro López
+1  A: 

The GetFiles method only reads the file names, not the file contents, so while reading all the names may be wasteful I don't think this is anything to worry about.

The only alternative as far as I know would be to do multiple GetFiles calls and add the results to a collection, but that gets clumsy and will require you to scan the folder several times, so I suspect it will be slower too.

Rune Grimstad
+2  A: 

I shared your problem and I found the solution in Matthew Podwysocki's excellent post at codebetter.com.

He implemented a solution using native methods that allows you to provide a predicate into his GetFiles implementation. Additionally he implemented his solution using yield statements effectively reducing the memory utilization per file to an absolute minimum.

With his code you can write something like the following:

var allowedExtensions = new HashSet<string> { ".jpg", ".mp3" };

var files = GetFiles(
    "C:\\path", 
    SearchOption.AllDirectories, 
    fn => allowedExtensions.Contains(Path.GetExtension(fn))
);

And the files variable will point to an enumerator that returns the files matched (delayed execution style).

Markus Olsson