views:

109

answers:

4

I have written a small crawler to scan and resort directory structures.

It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow:

around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal?

This is my code:

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}
+3  A: 

The very first time you ever do a recursive directory crawl, you should probably enumerate the entire current directory first and queue up any directories you find to visit when you are done. That way, you are likely to take advantage of any immediate read-ahead optimizations NTFS might do.

On subsequent recursive directory crawls, if the metadata for the directories fits in the system cache, it doesn't matter how you do it.

EDIT: clarifying how you should visit directories. It isn't technically a breadth first search.

MSN
This isn't going to make much difference in practice. The FindXFile apis boil down to the ZwQueryDirectoryFile/NtQueryDirectoryFile api ( http://msdn.microsoft.com/en-us/library/ff567047(VS.85).aspx ), which retrieves the entire directory at once. The FindXFile APIs already use an internal buffer around this function. The purpose of the FindClose API is the destruction of said buffer.
Billy ONeal
@Billy, in that case, part two of my answer kicks in :)Although with a sufficiently populated directory, won't that buffer require hitting the disk again? In which case the first part applies too :)
MSN
A: 

You're holding DIR handles open during the recursive dive. Instead, keep a local list of directories that you've encountered, and process them outside that loop, after the closedir().

John
This is going to increase memory consumption but it will have little to do with file scan time, which is limited by the hard disk.
Billy ONeal
+3  A: 

There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur significant overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.

Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.

Billy ONeal
A 7200 RPM hard disk spins 120 times per second. A complete rotation is therefore a little over 8ms. Since the head also has to move, the seek time is distributed between say (2, 10ms), with an average of about 6ms, plus then part of a rotation is needed to read the data. This agrees with read times of 8-9ms quoted on modern mainstream drives. SCSI and SATA support command queueing, by keeping multiple requests in flight the drive can sort them and average perhaps 20 block reads per rotation instead of 2. Other than overstating the cost of seeks slightly, your explanation is spot on.
Ben Voigt
@Ben Voigt: My value of ~16 ms comes from the Tom's Hardware hard drive charts: http://www.tomshardware.com/charts/2009-3.5-desktop-hard-drive-charts/h2benchw-3.12-Read-Access-Time,1007.html . Theoretically the drives should perform better than this, but in practice they do not. My thoughts were based on charts from a few years ago, when 16ms was the norm though. Modern drives are somewhere in between my 16 and your 8-9 numbers.
Billy ONeal
@Ben Voigt: Note that laptop hard drives, like the one the OP is testing with, are considerably worse: http://www.tomshardware.com/charts/2009-2.5-mobile-hard-drive-charts/h2benchw-3.12-Read-Access-Time,1107.html
Billy ONeal
Yeah, the OP didn't give a drive rotation speed but you're right that 7200 RPM is a bad assumption for a laptop. 4500 or 5400 is much more common. OTOH, as long as the controller is running in AHCI mode, command queueing ought to be good for an order of magnitude improvement if requests are issued overlapped.
Ben Voigt
+2  A: 

Probably the drive is the bottleneck. But you can try:

  1. String operations can be optimized - use char arrays instęad of std::string.
  2. Building strFullName every recursive call is not necessary. Use single, fixed buffer of chars (i.e. static array inside the function), modify it instantly.
  3. Don't pass the strPattern by value!
  4. Don't create strType until debugging
  5. Others suggested to build a list of directories to process before getting deeper into recursion. To build it I suggest single static array (similarly to 2.) or to use the stack (alloca).
  6. The filesysytem use Unicode to store file names? If so, using unicode strings with FindFirstFileW and FindNextFileW may be a little faster.
adf88