views:

291

answers:

2

My application indexes contents of all hard drives on end users computers. I am using Directory.GetFiles and Directory.GetDirectories to recursively process the whole folder structure. I am indexing only a few selected file types (up to 10 filetypes).

I am seeing in profiler that most of the indexing time is spent in enumerating files and folders - depending on ratio of files that will actually be indexed up to 90 percent of time.

I would like to make the indexing as fast as possible. I have already optimized the indexing itself and processing of the indexed files.

I was thinking using Win32 API calls, but I am actually seeing in the profiler that most of the processing time is actually spent on these API calls done by .NET.

Is there a (possibly low level) method accessible from C# that would make enumeration of files/folders at least partially faster?


As requested in the comment, my current code (just a scheme with irrelevant parts trimmed):

    private IEnumerable<IndexedEntity> RecurseFolder(string indexedFolder)
    {
        //for a single extension:
        string[] files = Directory.GetFiles(indexedFolder, extensionFilter);
        foreach (string file in files)
        {
            yield return ProcessFile(file);
        }
        foreach (string directory in Directory.GetDirectories(indexedFolder))
        {
            //recursively process all subdirectories
            foreach (var ie in RecurseFolder(directory))
            {
                yield return ie;
            }
        }
    }
A: 

In .NET 4.0, there are inbuilt enumerable file listing methods; since this isn't far away, I would try using that. This might be a factor in particular if you have any folders that are massively populated (requiring a large array allocation).

If depth is the issue, I would consider flattening your method to use a local stack/queue and a single iterator block. This will reduce the code path used to enumerate the deep folders:

    private static IEnumerable<string> WalkFiles(string path, string filter)
    {
        var pending = new Queue<string>();
        pending.Enqueue(path);
        string[] tmp;
        while (pending.Count > 0)
        {
            path = pending.Dequeue();
            tmp = Directory.GetFiles(path, filter);
            for(int i = 0 ; i < tmp.Length ; i++) {
                yield return tmp[i];
            }
            tmp = Directory.GetDirectories(path);
            for (int i = 0; i < tmp.Length; i++) {
                pending.Enqueue(tmp[i]);
            }
        }
    }

Iterate that, creating your ProcessFiles from the results.

Marc Gravell
One thing to add - watch out for reparse points. Otherwise, you might end up in an infinite recursion. For an example, see here:http://weblogs.asp.net/israelio/archive/2004/06/23/162913.aspx
peterchen
@peterchen - indeed; they're always fun.
Marc Gravell
.NET 4.0 is not an option for me, this is a .NET 2.0 application
Marek
They are in since .NET 2.0: http://msdn.microsoft.com/de-de/library/07wt70x2(VS.80).aspx
peterchen
@peterchen: you have posted a different link - the GetFiles obviously has been there for ages :), Marc refers to Directory.EnumerateFiles method: http://msdn.microsoft.com/en-us/library/dd383571(VS.100).aspx
Marek
whoops - sorry :)
peterchen
+1  A: 

If you believe that the .NET implementation is causing the problem then I suggest that you use the winapi calls _findfirst, _findnext etc.

It seems to me that .NET requires a lot of memory for because the lists are completely copied into the arrays at each level of directory - so if your directory structure is 10 levels deep you have 10 versions of the array files at any given moment and an allocation/deallocation of this array for every directory in the structure.

Using the same recursive technique with _findfirst etc will only require that handles to a position in the directory structure be kept at every level of recursion.

Elemental
There is no problem in the .NET implementation, at least not manifesting in my case. I simply want to make this faster.
Marek
I meant that the NET implementation was slowing the execution; was causing a performance problem.
Elemental