tags:

views:

625

answers:

9

I have a directory with around 15-30 thousand files. I need to just pull the oldest one. In other words the one that was created first. Is there a quick way to do this using C#, other than loading them into a collection then sorting?

+1  A: 

Sorting is O(n log n). Instead, why don't you just enumerate the directory? I'm not sure what the C# equivalent of FindFirstFile()/FindNextFile() is, but you want to do is:

  • Keep the current lowest date and filename in a local variable.

  • Enumerate the directory.

    • If the date on a given file is less than the local variable, set the local variable to the new date and filename.
asveikau
There really isn't a FindFirst/Next. It's wrapped into Directory.GetFiles. But you're right that all you need is a linear search, which is straight O(n).
Steven Sudit
In that case I might even wonder if you could shave off a few cycles by pinvoking the native enumeration APIs... There's no need to do extra memory allocation here to get them in a big array. You still won't get better than `O(n)` (in that sense the DB suggestion is better), but you can get `O(n)` with a lower constant factor and that might be good enough.
asveikau
Further the native enumeration APIs give you the time and the filename at the same time; it doesn't look like `GetFiles()` does.
asveikau
@asveikau: P/Invoking will not improve performance. You can get full directory information by calling `DirectoryInfo.GetDirectories()`: http://msdn.microsoft.com/en-us/library/s7xk2b58.aspx
280Z28
@280Z28 OK, but what about the memory allocation question? These functions will allocate big blocks of memory that's not really needed with the native interfaces.
asveikau
@asveikau: Memory allocation in the CLR is cheap - MUCH cheaper than calling out to the Windows API for a file system operation.
280Z28
I am as you might guess not very well versed in managed code, but I assume the implementation of `GetDirectories` must take this "calling out to Win32" hit as well. At any rate, I am only offering suggestions of things to try out and measure, not making any guarantee about what will yield better performance.
asveikau
+11  A: 

If you control the directory (that is, if your programs are responsible for creating and maintaining all files in that directory), then you should consider tracking the metadata about each file separately; perhaps in a database.

In fact, the FileStream column type in SQL Server 2008 can help with this. You can create a table that contains columns for filename, create date, modify date, and a FileStream column for the content. You can find things like the oldest file by using indexes on the metadata columns. You can find the content by using the FileStream column.

John Saunders
This isn't wrong, but it may well be overkill.
Steven Sudit
@Steven: Considering the number of files, maybe not. I also wonder what other database-like operations are performed on such directories. It's all moot if he doesn't control the directories involved.
John Saunders
I wonder what the big picture is here.
Steven Sudit
I suspect the right way, without SQL, would be to move files from a drop directory into a branching structure of directories, keeping track of those files in memory.
Steven Sudit
@Steven: we're speculating, of course, but I'd suggest that memory works until the system reboots. A database can survive system or application shutdown.
John Saunders
I wasn't completely clear, but I was thinking of an architecture along the lines of QMail. Upon startup, it recurses through its storage directories and reads them into memory. It then keeps track entirely in memory, only reading from the drop directory when it changes.
Steven Sudit
+7  A: 

The short answer is no. Windows file systems don't index files by date so there is no native way to do this, let alone a .net way without enumerating all of them.

thekaido
A: 

Look, would it not be easier to shell out to a hidden process and redirect the output stream to the input and use the dir /o-d which sorts by the date/time, using the dash reverses the operation....

Edit: here's a sample code to do this...quick and dirty...

public class TestDir
    {
        private StringBuilder sbRedirectedOutput = new StringBuilder();
        public string OutputData
        {
            get { return this.sbRedirectedOutput.ToString(); }
        }
        public void Run()
        {
            System.Diagnostics.ProcessStartInfo ps = new System.Diagnostics.ProcessStartInfo();
            ps.FileName = "cmd";
            ps.ErrorDialog = false;
            ps.Arguments = string.Format("dir {0} /o-d", path_name);
            ps.CreateNoWindow = true;
            ps.UseShellExecute = false;
            ps.RedirectStandardOutput = true;
            ps.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;

            using (System.Diagnostics.Process proc = new System.Diagnostics.Process())
            {
                proc.StartInfo = ps;
                proc.Exited += new EventHandler(proc_Exited);
                proc.OutputDataReceived += new System.Diagnostics.DataReceivedEventHandler(proc_OutputDataReceived);
                proc.Start();
                proc.WaitForExit();
                proc.BeginOutputReadLine();
                while (!proc.HasExited) ;
            }
        }

        void proc_Exited(object sender, EventArgs e)
        {
            System.Diagnostics.Debug.WriteLine("proc_Exited: Process Ended");
        }

        void proc_OutputDataReceived(object sender, System.Diagnostics.DataReceivedEventArgs e)
        {
            if (e.Data != null) this.sbRedirectedOutput.Append(e.Data + Environment.NewLine);
            //System.Diagnostics.Debug.WriteLine("proc_OutputDataReceived: Data: " + e.Data);
        }
    }

The very first 4 or 5 lines of the StringBuilder object sbRedirectedOutput can be chopped out,then after that line would contain the oldest filename and would be quite easy to parse out....

tommieb75
I don't see the benefit of this.
Steven Sudit
@Steven: the directory listing could be redirected to a string, and it's a matter of picking out the filename in the top of the listing...
tommieb75
"Look, would it not be easier..." Ummm... no.
280Z28
`dir` would read in the whole directory to sort it, so there's no benefit over doing it yourself.
Gabe
@gabe: I don't understand what you mean? Instead of having to read the directory into DirectoryInfo, and sorting this, this code I wrote is asynchronous, and reads in the directory which is redirected to a stringbuilder, the dir command sorts it, and is a matter of picking out the right filename from the top and bob's your uncle...why is that not feasible? Either hog up memory by reading in the DirectoryInfo yourself or do a simple shell out to do that task? Something's gotta give, something's gotta take...
tommieb75
When you just want the oldest file, you don't have to sort it. Just read over the directory once and return the lowest date you found, which is O(n). The dir command has no way of knowing you only care about the first result, so it has to perform a sort which is O(n lg n).
Gabe
If I thought that doing dir was a good idea, I would run `cmd /d /c dir /b /o-d "{0}"` and terminate the process as soon as the first result came in.
Gabe
@Gabe: the dir sort command switch does exactly that when shelling out to cmd.exe which is a hidden window...hence you are avoiding the O(n lg n) sort programmatically, as that is done externally via this code I supplied.....you're looking at it the wrong way or that you are not clear in what you're saying, you think that programatically a O(n lg n) is done in the C# code....which this answer shows *how to avoid the sorting programmatically*...
tommieb75
If you look at 280Z28's answer, you will see that it does not have to sort the directory. It looks at each directory entry exactly once, so it is O(n). The dir command will sort the whole directory, so it is O(n lg n). The point isn't to get something else to do the sorting for you, it's to avoid sorting altogether.
Gabe
+2  A: 

Edit: Removed the sort and made it a function.

public static FileInfo GetOldestFile(string directory)
{
    if (!Directory.Exists(directory))
        throw new ArgumentException();

    DirectoryInfo parent = new DirectoryInfo(directory);
    FileInfo[] children = parent.GetFiles();
    if (children.Length == 0)
        return null;

    FileInfo oldest = children[0];
    foreach (var child in children.Skip(1))
    {
        if (child.CreationTime < oldest.CreationTime)
            oldest = child;
    }

    return oldest;
}
280Z28
A: 

Here's a C# routine that may do what you want by spawning a cmd shell execute a dir /o:D on the specified directory and returning the name of the first file found.

        static string GetOldestFile(string dirName)
        {
            ProcessStartInfo si = new ProcessStartInfo("cmd.exe");
            si.RedirectStandardInput = true;
            si.RedirectStandardOutput = true;
            si.UseShellExecute = false;
            Process p = Process.Start(si);
            p.StandardInput.WriteLine(@"dir " + dirName + " /o:D");
            p.StandardInput.WriteLine(@"exit");
            string output = p.StandardOutput.ReadToEnd();
            string[] splitters = { Environment.NewLine };
            string[] lines = output.Split(splitters, StringSplitOptions.RemoveEmptyEntries);
            // find first line with a valid date that does not have a <DIR> in it
            DateTime result;
            int i = 0;
            while (i < lines.Length)
            {
                string[] tokens = lines[i].Split(' ');
                if (DateTime.TryParse(tokens[0], out result))
                {
                    if (!lines[i].Contains("<DIR>"))
                    {
                        return tokens[tokens.Length - 1];
                    }
                }
                i++;
            }

            return "";
        }
I was thinking of going this route, but its about the same as a LINQ approach right?
JL
Though this solution is correct, it is slow and filled with corner cases. Buyer beware.
rpetrich
+3  A: 

You will have to load the FileInfo objects into a collection & sort, but it's a one-liner:

FileSystemInfo fileInfo = new DirectoryInfo(directoryPath).GetFileSystemInfos()
    .OrderByDescending(fi => fi.CreationTime).First();

Ok, two lines because it's a long statement.

Kirk Broadhurst
Great simple answer. It won't get better than this without a change in the directory structure IMO.
Daniel S
+2  A: 

You can't do it without sorting but what you can do is make it fast.

Sorting by CreationTime can be slow because first accessing this property for each file involves interrogation of the file system.

Use A Faster Directory Enumerator that preserves more information about files while enumerating and allows to do sorting faster.

Code to compare performance:

static void Main(string[] args)
{
    var timer = Stopwatch.StartNew();

    var oldestFile = FastDirectoryEnumerator.EnumerateFiles(@"c:\windows\system32")
        .OrderBy(f => f.CreationTime).First();

    timer.Stop();

    Console.WriteLine(oldestFile);
    Console.WriteLine("FastDirectoryEnumerator - {0}ms", timer.ElapsedMilliseconds);
    Console.WriteLine();

    timer.Reset();
    timer.Start();

    var oldestFile2 = new DirectoryInfo(@"c:\windows\system32").GetFiles()
        .OrderBy(f => f.CreationTime).First();

    timer.Stop();

    Console.WriteLine(oldestFile2);
    Console.WriteLine("DirectoryInfo - {0}ms", timer.ElapsedMilliseconds);

    Console.WriteLine("Press ENTER to finish");
    Console.ReadLine();
}

For me it gives this:

VEN2232.OLB

FastDirectoryEnumerator - 27ms

VEN2232.OLB

DirectoryInfo - 559ms

Konstantin Spirin
All I can say is WoW - ShamWoW! Thanks a lot for this, I will certainly implement this.
JL
@Konstantin: In .NET 4.0, the `DirectoryInfo.GetFiles()` method now preserves the additional info, so it doesn't suffer from the performance problem you've addressed.
280Z28
A: 

Oddly enough, this worked perfectly on a directory of mine with 3000+ jpg files:

DirectoryInfo di = new DirectoryInfo(dpath);

FileInfo[] rgFiles = di.GetFiles("*.jpg");

FileInfo firstfile = rgFiles[0];

FileInfo lastfile = rgFiles[rgFiles.Length - 1];

DateTime oldestfiletime = firstfile.CreationTime;

DateTime newestfiletime = lastfile.CreationTime;

cbabb