views:

542

answers:

7

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.

I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.

Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.

Is there any way I can bring down the time taken when I run the program first time??

+7  A: 

The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.

I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx

Evan
A: 

I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.

I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.

PInvoke has a sample of how to use the API.

ho1
+17  A: 

If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...

.NET4.0 Code (or use 3.5 with TaskParallelLibrary)

    private static long DirSize(string sourceDir, bool recurse)
    {
        long size = 0;
        string[] fileEntries = Directory.GetFiles(sourceDir);

        foreach (string fileName in fileEntries)
        {
            Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
        }

        if (recurse)
        {
            string[] subdirEntries = Directory.GetDirectories(sourceDir);

            Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
            {
                if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    subtotal += DirSize(subdirEntries[i], true);
                    return subtotal;
                }
                return 0;
            },
                (x) => Interlocked.Add(ref size, x)
            );
        }
        return size;
    }
spookycoder
This is a good try. I will check it out
Xinxua
At least it probably optimizes the user-mode operations.
kenny
When I was at the Microsoft Visual Studio 2010 launch event (UK Tech Days) the example used to demonstrate the new Parallel LINQ methods was exactly this: calculating directory size. IIRC we saw at least a 2x speed increase when using PLINQ on his quad core laptop. It's in one of the videos here but I can't remember which one: http://www.microsoft.com/uk/techdays/resources.aspx
Codesleuth
A: 

Peformance will suffer using any method when scanning a folder with tens of thousands of files.

  • Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.

  • Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.

  • How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.

  • You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.

  • I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).

  • Yes, the operating-system caches file access.

Suggestions

  • I wonder if BackupRead would help in your case?

  • What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)

  • What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.

AMissico
This is incorrect. DIR does not read from the file allocaion table. Neither does Windows Explorer. Both make calls that go through Kernel32 and NTDLL and are handled by the filesystem drivers in kernel mode. I ran the dependency walker (depends.exe) on cmd.exe and determined that the DIR command make calls to the Kernel32.dll routines FindFirstFileW and FindNextFileW. So shelling out to the DIR command will be slower than just calling these yourself.
Ray Burns
First, it is not possible to use "depends" to determine what API calls the DIR command uses.
AMissico
Second, if you monitor the DIR command using "Process Monitor" you will notice only QueryDirectory operations are performed. If you create a simple console application in .NET that calls `GetFileSystemInfos` and `GetDirectories` you will notice the same operations are performed more often, including numerous `CloseFile` and `CreateFile` operations. These .NET methods call the API routines. Therefore, you can infer the DIR command is not calling these API functions.
AMissico
Third, do what I did. Create a console application using C/C++. This application only calls the API routines and recurses down a folder structure. It does not output any content. Compare its execution time to the same DIR command redirected to NULL or to a file. The DIR command is always significanly faster. All access must go through the filesystem drivers, but DIR and in some cases Windows Explorer, read directly from the file allocation table. See Chris Gray's answer.
AMissico
Lastly, if you really want to disprove the DIR reads directly from the "fat", use DEBUG and debug CMD. I choose to write the test application to verify the behavior I was experiencing. It is my opinion, that DIR has some kind of a "hook" that allows it to read the file-allocation-table in "blocks". (Most likely it uses the technique in Chris Gray's answer.) There is no other explaination for its ability to read file informat from the hard drive so quickly.
AMissico
+1  A: 

With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.

So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.

It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)

Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.

Chris Kemp
please dont do this, you'll end up hogging resources for all other apps. not to mention this trick is very fragile.
stuck
+3  A: 

Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)

The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.

some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1277085832&amp;sr=8-1

FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

stuck
A: 

I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)

Try this:

[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
    public uint fileAttributes;
    public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
    public uint fileSizeHigh;
    public uint fileSizeLow;
}

public enum GET_FILEEX_INFO_LEVELS
{
    GetFileExInfoStandard,
    GetFileExMaxInfoLevel
}

public class NativeMethods {
    [DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
    public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS  level, out WIN32_FILE_ATTRIBUTE_DATA data);

}

Now simply do the following:

WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {

     long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}
Adrian Regan
Not working on my machine. File-size-high and file-size-low are always Zero for folders.
AMissico
Have you tried it with GET_FILEEX_INFO_LEVELS.GetFileMaxInfoLevel?? Also no trailing '\' at the end of the path?
Adrian Regan