views:

169

answers:

2

Hello everyone,

I am using VSTS 2008 + C# + .Net 3.5 to develop a console application. I need to enumerate the most recent 50 files in current folder (to read file content, to get file meta data like file name, creation time, etc.). Since the current folder has about 5,000 files, and if I use Directory.GetFiles API, all 5,000 files' meta data information will be read into memory. I think it is a waste since I only need to access the most recent 50 files.

Any solutions to access only the 50 most recent files in current directory?

thanks in advance, George

+3  A: 

This solution still loads metadata about all files, but I would say it's fast enough for most uses. The following code reports that it takes around 50ms to enumerate the 50 most recently updated files in my Windows\System32 directory (~2500 files). Unless the code is run very frequently I would probably not spend time optimizing it a lot more:

FileInfo[] files = (new DirectoryInfo(@"C:\WINDOWS\System32")).GetFiles();
Stopwatch sw = new Stopwatch();
sw.Start();
IEnumerable<FileInfo> recentFiles = files.OrderByDescending(
                                              fi => fi.LastWriteTime).Take(50);
List<FileInfo> list = recentFiles.ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
list.ForEach(fi => Console.WriteLine(fi.Name));

Update

Based on the discussion in the comments regarding using date/time in the file name: note that Directory.GetFiles does not load metadata about files; it simply returns a string array with file names (DirectoryInfo.GetFiles on the other hand returns an array of FileInfo objects). So, if you have date and time in your file names (preferably in a format that lends itself to sorting, such as yyyyMMdd-HHmmss or something like that) you can use Directory.GetFiles to get the file names, sort descending and then pick the 50 first from the list:

string[] files = Directory.GetFiles(pathToLogFiles);
IEnumerable<string> recentFiles = files.OrderByDescending(s => s).Take(50);
List<string> recentFiles = recentFiles.ToList();

Then loop over the list and load whatever data you need from each file.

Fredrik Mörk
Thanks! The problem for me is, the directory is for audit log file, and in my application every day there will be about 50 audit files (this is why I need to access the most recent 50 files generated daily to generate daily audit report). If my application runs for a year, the # of audit file will be large. Any comments? If you think from .Net File and Directory API level, there is no solution, maybe it is my design issue?
George2
Is it a possiblity to make the date/time part of the audit log file names? If so you can simply sort descending based on file name and take the first 50 files.
Fredrik Mörk
George2: Have a look at the FileSystemWatcher class, it might help you: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx
DrJokepu
Hi Fredrik, even if I can name the file by using datetime, how can I just retrieve meta information about just the most recent 50 files? If I name the file by datatime, but still use Directory.GetFiles, no benefit -- still need to read all file meta data into memory. Any comments? Correct me if I wrongly understand your points.
George2
Hi DrJokepu, I studied this class and it is fancy. But I do not find any ideas how to utilize this class to solve my issue. Appreciate if you could share more insights? Thanks.
George2
George2: If you have control over where the log files are stored, you could store them in a different folder each day. Then you would just have to enumerate the files in the folder for the current day, and if they're less than 50, the files in the folder for the day before, so max 100 files.
dtb
@George2: I updated the answer with some discussion on the date/time in the file name.
Fredrik Mörk
Good idea, thanks dtb! You mean there is no solution using combinations of built-in APIs in .Net to just enumerate the meta data for the most recent 50 files?
George2
Thanks Fredrik, I want to confirm with you that your great idea is -- using Directory (String) other than DirectoryInfo (FileInfo) will use less memory and more efficient, but can only access file name information, so we name files with rules so that we just need lightweighted API Directory.GetFiles other than DirectoryInfo.GetFiles?
George2
@Geroge2: yes, that was the idea.
Fredrik Mörk
So smart idea! I have marked your reply as answered.
George2
+1  A: 

I'm really not sure it will be worth your while... consider the following program:

 class DateCompare : IComparer<FileInfo>
 {
  public int Compare(FileInfo a, FileInfo b)
  { 
   int result = a.LastWriteTime.CompareTo(b.LastWriteTime);
   if (result == 0)
    return StringComparer.OrdinalIgnoreCase.Compare(a.FullName, b.FullName);
   return result;
  }
 }

 public static void Main(string[] args)
 {
  DirectoryInfo root = new DirectoryInfo("c:\\Projects\\");
  DateTime start = DateTime.Now;
  long memory = GC.GetTotalMemory(false);
  FileInfo[] allfiles = root.GetFiles("*", SearchOption.AllDirectories);
  DateTime sortStart = DateTime.Now;
  List<FileInfo> files = new List<FileInfo>(20000);
  IComparer<FileInfo> cmp = new DateCompare();
  foreach (FileInfo file in allfiles)
  {
   int pos = ~files.BinarySearch(file, cmp);
   files.Insert(pos, file);
  }
  Console.WriteLine("Count = {0:#,###}, Read = {1}, Sort = {2}, Memory = {3:#,###}", files.Count, sortStart - start, DateTime.Now - sortStart, GC.GetTotalMemory(false) - memory);
 }

This is the output of the above program:

Count = 16,357, Read = 00:00:03.5793579, Sort = 00:00:06.7776777, Memory = 5,758,976
Count = 16,357, Read = 00:00:03.2173217, Sort = 00:00:06.1616161, Memory = 7,339,920
Count = 16,357, Read = 00:00:03.5083508, Sort = 00:00:06.7556755, Memory = 10,346,504

That running in 3 seconds allocating between 5~10mb while crawling 6,931 directories and returning 16k file names. That is three times the volume your talking about and I bet most of the time is crawling the directory tree (I don't have a directory with 5k worth of files). The worst expense is always going to be the sort, if you can throw out files by matching file names I would recommend that.

csharptest.net
Thanks! I want to confirm that from your experiment, you want to prove even if I get all files' meta data and sort in memory is not a big deal?
George2
Depending on the sort chosen it should not be a big deal. For large sorts of unique data I tend to use the binary insertion sort above. I know there are better ways but it performs well enough and much faster than List<T>.Sort().
csharptest.net
Why using BinarySearch + Insert is faster than insert all items un-sorted, then using List.sort?
George2
Without doing a deep-dive on the implementation of List.Sort, I'd have to say it has to do with the wasted cycles of effecting the move on each array item. Believe it or not the above code is also faster than using a SortedList<T>. I've tried each with roughly the same code, try it for yourself. You should see about a 50% increase to switch to SortedList<T>, and almost a 100% increase to switch to List.Sort().
csharptest.net
Thanks, good idea!
George2