views:

819

answers:

3

We have encountered an unexpected performance issue when traversing directories looking for files using a wildcard pattern.

We have 180 folders each containing 10,000 files. A command line search using dir <pattern> /s completes almost instantly (<0.25 second). However, from our application the same search takes between 3-4 seconds.

We initially tried using System.IO.DirectoryInfo.GetFiles() with SearchOption.AllDirectories and have now tried the Win32 API calls FindFirstFile() and FindNextFile().

Profiling our code using indicates that the vast majority of execution time is spent on these calls.

Our code is based on the following blog post:

http://codebetter.com/blogs/matthew.podwysocki/archive/2008/10/16/functional-net-fighting-friction-in-the-bcl-with-directory-getfiles.aspx

We found this to be slow so updated the GetFiles function to take a string search pattern rather than a predicate.

Can anyone shed any light on what might be wrong with our approach?

A: 

You can try with an implementation of FindFirstFile and FindNextFile I once blogged about.

Darin Dimitrov
Our approach is very similar to that Darin
Richard Ev
I've tested my solution and it takes 230 milliseconds to enumerate a directory containing > 100K files.
Darin Dimitrov
A further speedup can be achieved with `FindFirstFileEx (... FindExInfoBasic ...)`
MSalters
In our scenario we have 180 folders, each containing around 10,000 files. The split across multiple folders is what appears to kill the performance.
Richard Ev
+2  A: 

A simple test with Process Monitor shows that cmd.exe dir command and File.GetFiles behave significantly different. Here is what .NET Directory.GetFiles() does for a single directory:

"CreateFile","d:\somedir","SUCCESS","Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Complete If Oplocked, Open For Backup, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened"
"SetBasicInformationFile","d:\somedir","SUCCESS","CreationTime: 1/1/1601 1:59:59 AM, LastAccessTime: 1/1/1601 1:59:59 AM, LastWriteTime: 1/1/1601 1:59:59 AM, ChangeTime: 1/1/1601 1:59:59 AM, FileAttributes: n/a"
"QueryFileInternalInformationFile","d:\somedir","SUCCESS","IndexNumber: 0x4000000000030"
"FileSystemControl","d:\somedir","END OF FILE","Control: FSCTL_FILE_PREFETCH"
"CloseFile","d:\somedir","SUCCESS",""

On the other hand cmd.exe behaves like this:

"CreateFile","d:\somedir","SUCCESS","Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened"
"QueryDirectory","d:\somedir\*","SUCCESS","Filter: *, 1: ."
"QueryDirectory","d:\somedir","SUCCESS"
"QueryDirectory","d:\somedir","NO MORE FILES",""
"CloseFile","d:\somedir","SUCCESS",""
"CreateFile","d:\somedir","SUCCESS","Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened"
"QueryDirectory","d:\somedir\*","SUCCESS","Filter: *, 1: ."
"QueryDirectory","d:\somedir","SUCCESS"
"QueryDirectory","d:\somedir","NO MORE FILES",""
"CloseFile","d:\somedir","SUCCESS",""

Although cmd.exe seems to be doing twice the work in terms of number of operations, it doesn't seem to be calling APIs NtSetBasicInformationFile, NtQueryFileInternalInformationFile or NtFileSystemControl. It only uses NtQueryDirectoryFile to get the information it wants.

The most susceptible API is NtSetBasicInformationFile which sets a "LastAccessTime" that cmd.exe doesn't bother doing. As you can see this requires "write" operation to file system structures and might be incurring the actual overhead.

However my research is incomplete:

  • I didn't verify if .NET is really slower than cmd.exe. I just compared their operations.

  • I'm not sure if asker took "process startup time" into account when comparing "dir" command with a standalone executable.

  • Some references say FindFirstFile uses NtQueryDirectoryFile but I didn't verify this with Microsoft resources.

  • Someone needs to go through Process Monitor stack traces to find out which specific Win32 APIs are used and run tests using them instead.

ssg
Doing the same analysis using our approach that used Win32 API calls shows that the disk operations are almost identical.
Richard Ev
A: 

Try IShellFolder::EnumObjects with SHGetDataFromIDList/IShellFolder::GetAttributesOf. Pro/Cons here.

Sheng Jiang 蒋晟