views:

413

answers:

7

I have a huge directory of about 500k jpg files, and I'd like to archive all files that are older than a certain date. Currently, the script takes hours to run.

This has a lot to do with the very piss-poor performance of GoGrid's storage servers, but at the same time, I'm sure there's a way more efficient way Ram/Cpu wise to accomplish what I'm doing.

Here's the code I have:

var dirInfo = new DirectoryInfo(PathToSource);
var fileInfo = dirInfo.GetFiles("*.*");
var filesToArchive = fileInfo.Where(f => 
    f.LastWriteTime.Date < StartThresholdInDays.Days().Ago().Date
      && f.LastWriteTime.Date >= StopThresholdInDays.Days().Ago().Date
);

foreach (var file in filesToArchive)
{
    file.CopyTo(PathToTarget+file.Name);
}

The Days().Ago() stuff is just syntactic sugar.

+2  A: 

I'd keep in mind the 80/20 rule and note that if the bulk of the slowdown is file.CopyTo, and this slowdown far outweighs the performance of the LINQ query, then I wouldn't worry. You can test this by removing the file.CopyTo line and replacing it with a Console.WriteLine operation. Time that versus the real copy. You'll find the overhead of GoGrid versus the rest of the operation. My hunch is there won't be any realistic big gains on your end.

EDIT: Ok, so the 80% is the GetFiles operation, which isn't surprising if in fact there are a million files in the directory. Your best bet may be to begin using the Win32 API directly (like FindFirstFile and family) and P/Invoke:

[DllImport("kernel32.dll", CharSet=CharSet.Auto)]
static extern IntPtr FindFirstFile(string lpFileName, 
    out WIN32_FIND_DATA lpFindFileData);

I'd also suggest, if possible, altering the directory structure to decrease the number of files per directory. This will improve the situation immensely.

EDIT2: I'd also consider changing from GetFiles("*.*") to just GetFiles(). Since you're asking for everything, no sense in having it apply globbing rules at each step.

sixlettervariables
The bulk of the operation is the dirInfo.GetFiles("*.*") statement.I'm doing a test with only 5 days worth of files, and I run out of RAM/Patience before I can even get a count of the files in the directory from which to do the linq query.Is there a better way to GetFiles[], like just have GetFiles[] return Files that are within a range, instead of having to return them all?At least that way, I can break this operation into chunks of 10% this first time, and then have the archiver run every night.As it stands now, I can't really get anywhere.
Scott
Yes, altering the directory structure is what I'm trying to do, but first I need to access files without waiting all day and timing out the server :)
Scott
+10  A: 

The only part that I think you could improve is the dirInfo.GetFiles("*.*"). In .NET 3.5 and earlier, it returns an array with all the file names, which takes time to build and uses lots of RAM. In .NET 4.0, there is a new Directory.EnumerateFiles method that returns an IEnumerable<string> instead, and fetches results immediately as they are read from the disk. This could improve performance a bit, but don't expect miracles...

Thomas Levesque
Actually that is exatcly what needs to be done, EnumerateFiles returns Enumerator not the whole list. You save all the memory needed for the array.Let's say its 500k files * 100bytes = 50MBs of RAM. Using Enumerate you will only use up 100bytes, because you get 1 file at a time.
Kugel
+1, .Net 4.0 has lots of really nice features in System.IO. Not sure if it will improve the situation with a million files in a directory :-D
sixlettervariables
+2  A: 

You should consider using a third party utility to perform the copying for you. Something like robocopy may speed up your processing significantly. See also http://serverfault.com/questions/54881/quickest-way-of-moving-a-large-number-of-files

Manu
+1, robocopy /minage=X /maxage=Y
sixlettervariables
And robocopy is included in Win7 and Server 2008 by default!
joshperry
yes, not exactly what I'd call "third party" ;)
Thomas Levesque
A: 

Take a listen to this Hanselminutes podcast. Scott talks to Aaron Bockover the author of Banshee media player, they ran in to this exact issue and talk about it at 8:20 in the podcast.

If you can use .Net 4.0 then use their Directory.EnumerateFiles as mentioned by Thomas Levesque. If not then you may need to write your own directory walking code like they did in Mono.Posix using the native Win32 APIs.

joshperry
+2  A: 

While .NET 4.0 provides the lazy Directory.EnumerateFiles, you can do this right now on .NET 3.5:

Mauricio Scheffer
Thanks Mauricio...this works for the RAM problem, but not CPU. It still takes hours to accomplish but at least the RAM doesn't balloon out on me.
Scott
That works well enough to solve my problem. Takes about 2 hours, but now it can run in the background w/ a maximum of 4 megs of RAM, whereas before, it would use hundreds of megs.
Scott
+1  A: 

Reading your question - you want to archive files, so I would expect you want to move them, not copy maybe?

Possibly change

foreach (var file in filesToArchive)
{
  file.CopyTo(PathToTarget+file.Name);
}

to

foreach (var file in filesToArchive)
{
  file.MoveTo(PathToTarget+file.Name);
}

... because a move operation is similar to changing file pointer in the FAT, whereas a copy has to literally duplicate all the bytes of the file.

I'd expect the move would be much quicker...

Neil Fenwick
Ya, Neil, that's good advice. Keep in mind however that the major bottleneck comes BEFORE the file copy. Just scanning the folder to find all files that are older than 30 days takes hours. Even shutting off copy period is a no-go for this script.
Scott
@Scott fair-do's you're right. The O/S has to aggregate 500k plus records and then you start to loop through it again in your code. Added another answer (completely different approach, so I didn't edit this one)
Neil Fenwick
A: 

After learning that its the operation of iterating over all those files thats taking so long, I'm thinking you might have to change strategy.

Maybe you're biting off more than you can chew in one go.

Possibly create a service that triggers several times a day and just does a few files at a time... Not sure what your file names look like, but you'll get the pattern below.

// Get all files starting with "a"

var dirInfo = new DirectoryInfo(PathToSource);
var fileInfo = dirInfo.GetFiles("a.*");
var filesToArchive = fileInfo.Where(f => 

f.LastWriteTime.Date < StartThresholdInDays.Days().Ago().Date
  && f.LastWriteTime.Date >= StopThresholdInDays.Days().Ago().Date
);

foreach (var file in filesToArchive)
{
    file.MoveTo(PathToTarget+file.Name);
}


// Rinse repeat for each letter of alphabet

The code above should get progressively faster as you move more and more files out of the folder.

Neil Fenwick