views:

70

answers:

2

Situation:

I have a C# program which does the following:

  • Generate many files (replacing the ones generated last time the program ran.)
  • Read those files and perform a time-consuming computation.

Problem:

I only want to perform the time-consuming computation on files which have actually changed since the last time I ran the program.

Solution 1:

  • Rename the old file.
  • Write the new file.
  • Read and compare both files.

This involves writing one file and reading two, which seems like more disk access than necessary.

Solution 2:

  • Write to a string instead of a file.
  • Read the old file and compare to the string.
  • If they are different, overwrite the old file.

This would involve reading one file and possibly writing one, which seems like a big improvement over my first idea.

Question:

Can you describe a better way to solve my problem? (and explain why it is better?)

+4  A: 

One solution could be to generate some sort of checksum from the contents of the file. Then when you generate a new contents you only need to compare the checksum values to see if the files have changed.

Store the checksum as the first record in the file (or at least fairly near the start of the file) to minimise the amount of data you have to read.

If you could somehow store the checksum as an attribute of the file (rather than in the file itself) you wouldn't even need to open the old file. Another alternative would be to store the checksum and the file it referred to in another central file or database, but there is the danger that could get out of step.

ChrisF
Thanks, this is a valid solution to my problem. Would it be even better to keep all of the checksums in one file so that they can be read sequentially?
Paul Williams
An MD5 hash, say, could go right in the filename.
Frank Schmitt
@Paul - My preference would be to keep them attached to the file as then there's less chance of them getting out of step, but there's nothing wrong with keeping them in a separate file.
ChrisF
@ChrisF - I see your point. There would be more ways to go wrong.
Paul Williams
A: 

At the end of each run, save the execution time to a file.

During the next run, after you've created all the new files, use DirectoryInfo to iterate through the files in the directory and check each files's GetLastWriteTime (http://msdn.microsoft.com/en-us/library/system.io.file.getlastwritetime.aspx) against the stored execution time. If the LastWriteTime is after the saved time, that file was modified by the current execution, so you have to process it.

Jim Flynn
Unfortunately this won't work in my case. When writing to a file, the "write time" is always updated, regardless of whether or not the end-result is an identical file.
Paul Williams