views:

674

answers:

5

I am writing a directory scanner in .NET.

For each File/Dir I need the following info.

   class Info {
        public bool IsDirectory;
        public string Path;
        public DateTime ModifiedDate;
        public DateTime CreatedDate;
    }

I have this function:

      static List<Info> RecursiveMovieFolderScan(string path){

        var info = new List<Info>();
        var dirInfo = new DirectoryInfo(path);
        foreach (var dir in dirInfo.GetDirectories()) {
            info.Add(new Info() {
                IsDirectory = true,
                CreatedDate = dir.CreationTimeUtc,
                ModifiedDate = dir.LastWriteTimeUtc,
                Path = dir.FullName
            });

            info.AddRange(RecursiveMovieFolderScan(dir.FullName));
        }

        foreach (var file in dirInfo.GetFiles()) {
            info.Add(new Info()
            {
                IsDirectory = false,
                CreatedDate = file.CreationTimeUtc,
                ModifiedDate = file.LastWriteTimeUtc,
                Path = file.FullName
            });
        }

        return info; 
    }

Turns out this implementation is quite slow. Is there any way to speed this up? I'm thinking of hand coding this with FindFirstFileW but would like to avoid that if there is a built in way that is faster.

A: 

try this (i.e. do the initialization first, and then reuse your list and your directoryInfo objects):

  static List<Info> RecursiveMovieFolderScan1() {
      var info = new List<Info>();
      var dirInfo = new DirectoryInfo(path);
      RecursiveMovieFolderScan(dirInfo, info);
      return info;
  } 

  static List<Info> RecursiveMovieFolderScan(DirectoryInfo dirInfo, List<Info> info){

    foreach (var dir in dirInfo.GetDirectories()) {

        info.Add(new Info() {
            IsDirectory = true,
            CreatedDate = dir.CreationTimeUtc,
            ModifiedDate = dir.LastWriteTimeUtc,
            Path = dir.FullName
        });

        RecursiveMovieFolderScan(dir, info);
    }

    foreach (var file in dirInfo.GetFiles()) {
        info.Add(new Info()
        {
            IsDirectory = false,
            CreatedDate = file.CreationTimeUtc,
            ModifiedDate = file.LastWriteTimeUtc,
            Path = file.FullName
        });
    }

    return info; 
}
Jimmy
This makes no real difference in my benchmark - your method take 33156 mine take 33498... the interop one takes 2872 milliseconds .... over 10X faster
Sam Saffron
this was what i first thought, too ... but then i tested and noticed it had almost same performance. =/
Lucas
try it over an smb share
Sam Saffron
+4  A: 

Depending on how much time you're trying to shave off the function, it may be worth your while to call the Win32 API functions directly, since the existing API does a lot of extra processing to check things that you may not be interested in.

If you haven't done so already, and assuming you don't intend to contribute to the Mono project, I would strongly recommend downloading Reflector and having a look at how Microsoft implemented the API calls you're currently using. This will give you an idea of what you need to call and what you can leave out.

You might, for example, opt to create an iterator that yields directory names instead of a function that returns a list, that way you don't end up iterating over the same list of names two or three times through all the various levels of code.

tylerl
What has Mono got to do with this...?
Cyril Gupta
@Cyril, at http://mono-project.com/Contributing you can read about their requirements. They explicitly say that "if you have looked at Microsoft's implementation of .NET or their shared source code, you will not be able to contribute to Mono". They also mention reflector.
Simon Svensson
+8  A: 

This implementation, which needs a bit of tweaking is 5-10X faster.

    static List<Info> RecursiveScan2(string directory) {
        IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
        WIN32_FIND_DATAW findData;
        IntPtr findHandle = INVALID_HANDLE_VALUE;

        var info = new List<Info>();
        try {
            findHandle = FindFirstFileW(directory + @"\*", out findData);
            if (findHandle != INVALID_HANDLE_VALUE) {

                do {
                    if (findData.cFileName == "." || findData.cFileName == "..") continue;

                    string fullpath = directory + (directory.EndsWith("\\") ? "" : "\\") + findData.cFileName;

                    bool isDir = false;

                    if ((findData.dwFileAttributes & FileAttributes.Directory) != 0) {
                        isDir = true;
                        info.AddRange(RecursiveScan2(fullpath));
                    }

                    info.Add(new Info()
                    {
                        CreatedDate = findData.ftCreationTime.ToDateTime(),
                        ModifiedDate = findData.ftLastWriteTime.ToDateTime(),
                        IsDirectory = isDir,
                        Path = fullpath
                    });
                }
                while (FindNextFile(findHandle, out findData));

            }
        } finally {
            if (findHandle != INVALID_HANDLE_VALUE) FindClose(findHandle);
        }
        return info;
    }

extension method:

 public static class FILETIMEExtensions {
        public static DateTime ToDateTime(this System.Runtime.InteropServices.ComTypes.FILETIME filetime ) {
            long highBits = filetime.dwHighDateTime;
            highBits = highBits << 32;
            return DateTime.FromFileTimeUtc(highBits + (long)filetime.dwLowDateTime);
        }
    }

interop defs are:

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    public static extern IntPtr FindFirstFileW(string lpFileName, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
    public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATAW lpFindFileData);

    [DllImport("kernel32.dll")]
    public static extern bool FindClose(IntPtr hFindFile);

    [StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
    public struct WIN32_FIND_DATAW {
        public FileAttributes dwFileAttributes;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
        internal System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
        public int nFileSizeHigh;
        public int nFileSizeLow;
        public int dwReserved0;
        public int dwReserved1;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        public string cFileName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
        public string cAlternateFileName;
    }
Sam Saffron
+1  A: 

I'd use or base myself on this multi-threaded library: http://www.codeproject.com/KB/files/FileFind.aspx

Bertvan
+1  A: 

Its pretty shallow, 371 dirs with average of 10 files in each directory. some dirs contain other sub dirs

This is just a comment, but your numbers do appear to be quite high. I ran the below using essentially the same recursive method you are using and my times are far lower despite creating string output.

    public void RecurseTest(DirectoryInfo dirInfo, 
                            StringBuilder sb, 
                            int depth)
    {
        _dirCounter++;
        if (depth > _maxDepth)
            _maxDepth = depth;

        var array = dirInfo.GetFileSystemInfos();
        foreach (var item in array)
        {
            sb.Append(item.FullName);
            if (item is DirectoryInfo)
            {
                sb.Append(" (D)");
                sb.AppendLine();

                RecurseTest(item as DirectoryInfo, sb, depth+1);
            }
            else
            { _fileCounter++; }

            sb.AppendLine();
        }
    }

I ran the above code on a number of different directories. On my machine the 2nd call to scan a directory tree was usually faster due to caching either by the runtime or the file system. Note that this system isn't anything too special, just a 1yr old development workstation.

// cached call
Dirs = 150, files = 420, max depth = 5
Time taken = 53 milliseconds

// cached call
Dirs = 1117, files = 9076, max depth = 11
Time taken = 433 milliseconds

// first call
Dirs = 1052, files = 5903, max depth = 12
Time taken = 11921 milliseconds

// first call
Dirs = 793, files = 10748, max depth = 10
Time taken = 5433 milliseconds (2nd run 363 milliseconds)

Concerned that I wasn't getting the create and modified date, the code was modified to output this as well with the following times.

// now grabbing last update and creation time.
Dirs = 150, files = 420, max depth = 5
Time taken = 103 milliseconds (2nd run 93 milliseconds)

Dirs = 1117, files = 9076, max depth = 11
Time taken = 992 milliseconds (2nd run 984 milliseconds)

Dirs = 793, files = 10748, max depth = 10
Time taken = 1382 milliseconds (2nd run 735 milliseconds)

Dirs = 1052, files = 5903, max depth = 12
Time taken = 936 milliseconds (2nd run 595 milliseconds)

Note: System.Diagnostics.StopWatch class used for timing.

Robert Paulson
yerp all my numbers are from scanning a network share. so its expected to be quite high
Sam Saffron
Yeah, your slower access speeds makes more sense now.
Robert Paulson