tags:

views:

309

answers:

4

Hello everyone :)

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.

How might I accomplish something like this?

A: 

Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.

Mark Ransom
I'm already doing this.
Billy ONeal
+4  A: 

"Everything" builds an index in the background, so queries are against the index not the file system itself.

There are a few improvements to be made - at least over the straight-forward algorrithm:

First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.

On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.

Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).


[edit] Wikipedia actually has some comments - Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.

A good starting point would be the NTFS Technical Reference on MSDN.

peterchen
Definately reading from an index is quicker, but how do they build/update that index to begin with? Are they fudging their claimed times, or are they doing something special under the hood?
slugster
The NTFS change journal helps with keeping an index up to date.
Ben Voigt
@Ben Voigt: Yes, but if I erase the change journal and Everything's cache, start time is still well under a minute. @peterchen: Everything works on Windows XP, so I don't think the new Windows 7 API is the reason for it's speed.
Billy ONeal
@BillyONeal: see update
peterchen
A: 

"Everything" accesses directory infomration at a lower level than the Win32 FindFirst/FindNext APIs.

I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).

A couple other utilities that do the similar things are:

A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.

Michael Burr
"A little reverse engineering with a debugger" <-- I'd prefer to not break EULAs if at all possible. Also, I do not know x86 assembler which would make "debugging" difficult.
Billy ONeal
@BillyONeal: I can understand both points. As an aside - you don't necessarily need to delve into assembly language to get information. Tools like Dependency Walker (http://www.dependencywalker.com/) and Process Monitor (http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx) can tell you a lot about the APIs that a program uses without diving all the way into disassembly. Whether using these tools might violate a EULA is something I can't tell you.
Michael Burr
A: 

If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.

Mark Ransom