views:

95

answers:

5

I have 2 hard-disk volumes(one is a backup image of the other), I want to compare the volumes and list all the modified files, so that the user can select the ones he/she wants to roll-back.

Currently I'm recursing through the new volume and comparing each file's time-stamps to the old volume's files (if they are int the old volume). Obviously this is a blunder approach. It's time consuming and wrong!

Is there an efficient way to do it?

EDIT:
- I'm using FindFirstFile and likes to recurse the volume, and gather info of each file (not very slow, just a few minutes).
- I'm using Volume Shadow Copy to backup.
- The backup-volume is remote so I cannot continuously monitor the actual volume.

A: 

Assuming you're not comparing each file on the new volume to every file in the snapshot, that's the only way you can do it. How are you going to find which files aren't modified without looking at all of them?

Billy ONeal
This is the approach I've switched to (I take the diff of each file in the new volume with the old snapshot), but this is very slow.I was thinking maybe at a lower level (scanning and comparing the blocks or something?)
lalli
@lalli: There's no lower level (supported) API than `FindFirstFile` and friends. Even if you were going to do it by parsing the NTFS on disk format yourself, I doubt you could do it faster than Windows' own ntfs.sys.
Billy ONeal
A: 

I am not a Windows programmer. However shouldn't u have stat function to retrieve the modified time of a file. Sort the files based on mod time. The files having mod time greater than your last backup time are the ones of your interest.

For the first time u can iterate over the back up volume to figure out the max mod time and created time from your interested set. I am assuming the directories of interest don't get modified in the backup volume.

aeh
`stat` is not a Windows function.
Billy ONeal
shouldn't GetFileTime do the job. Also u don't even need to sort if u know the max mod time. U can get the list in a single pass of the current volume.
aeh
@user433874: No, `GetFileTime` requires a file handle, which means you'd have to open every file in question. `FindFirstFile` and friends **already returns the time** when you enumerate a directory, so I do not really see what your point is here.
Billy ONeal
@Billy: If there are api's like FindFirstFile are available, I don't understand why its so time consuming to just store the last back up time and compare the (last modified time of each file > last backup time) if true the file is modified. Am I missing something here, why do you have to compare each file with backup volume file.
aeh
@user433874: That's a really good optimization that you are suggesting. But the point is, the time-stamps can be easily modified (that's why I'm referring this method in my question as wrong), so the results may not be correct.
lalli
+2  A: 

Instead of waiting until after changes have happened, and then scanning the whole disk to find the (usually few) files that have changed, I'd set up a program to use ReadDirectoryChangesW to monitor changes as they happen. This will let you build a list of files with a minimum of fuss and bother.

Jerry Coffin
Note that this isn't going to work across reboots or other such fun. Also, it might lead to funny results when Volume Shadow Copy is in play. If a solution like this is acceptable, you should consider using the Usn Journal ( http://msdn.microsoft.com/en-us/library/aa363798.aspx ) instead.
Billy ONeal
@Billy: yes, I was figuring that if you were going to do this routinely, you'd probably implement it as a service to be started automatically. I hesitate to recommend the USN journal, simply because I've never used it, but from what I recall of the documentation it's probably a good choice.
Jerry Coffin
@Jerry: The issue is that even if you register to autostart, you cannot start early enough to catch changes made by, say, the kernel. Any solution based on monitoring will *not* be 100% reliable across reboots under *any* circumstances.
Billy ONeal
@Billy: True, but absent a reason to believe otherwise I'd guess he doesn't care -- many of those same changes aren't amenable to the methods he's using now either (e.g., your access to the registry files is *quite* limited).
Jerry Coffin
the backup-volume is remote so I cannot continuously monitor the actual volume.
lalli
@Lalli: In that Billie's right: USN journals are almost certainly the right answer.
Jerry Coffin
A: 

Without knowing more details about what you're trying to do here, it's hard to say. However, some tips about what I think you're trying to achieve:

  • If you're only concerned about NTFS volumes, I suggest looking into the USN / change journal API's. They have been around since 2000. This way, after the initial inventory you can only look at changes from that point on. A good starting point for this, though a very old article is here: http://www.microsoft.com/msj/0999/journal/journal.aspx
  • Also, utilizing USN API's, you could omit the hash step and just record information from the journal yourself (this will become more clear when/if you look into said APIs)
  • The first time through comparing a drive's contents, utilize a hash such as SHA-1 or MD5.
  • Store hashes and other such information in a database of some sort. For example, SQLite3. Note that this can take up a huge amount of space itself. A quick look at my audio folder with 40k+ files would result in ~750 megs of MD5 information.
NuSkooler
An MD5 hash of every file on the volume is probably going to be quite large -- plan on having several GB of space just for your index.
Billy ONeal
+1  A: 

Part of this depends upon how the two volumes are duplicated; if they are 'true' copies from the file system's point of view (e.g. shadow copies or other block-level copies), you can do a few tricky little things with respect to USN, which is the general technology others are suggesting you look into. You might want to look at an API like FSCTL_READ_FILE_USN_DATA, for example. That API will let you compare two different copies of a file (again, assuming they are the same file with the same file reference number from block-level backups). If you wanted to be largely stateless, this and similar APIs would help you a lot here. My algorithm would look something like this:

foreach( file in backup_volume ) {
    file_still_exists = try_open_by_id( modified_volume )
    if (file_still_exists) {
        usn_result = compare_usn_values_of_files( file, file_in_modified_volume )
        if (usn_result == equal_to) {
           // file hasn't changed at all
        } else {
           // file has changed (somehow)
        }
    } else {
        // file was deleted (possibly deleted and recreated)
    }
}
// we still don't know about files new in modified_volume

All of that said, my experience leads me to believe that this will be more complicated than my off-the-cuff explanation hints at. This might be a good starting place, though.

If the volumes are not block-level copies of one another, then it will be very difficult to compare USN numbers and file IDs, if not impossible. Instead, you may very well be going by file name, which will be difficult if not impossible to do without opening every file (times can be modified by apps, sizes and times can be out of date in the findfirst/next queries, and you have to handle deleted-then-recreated cases, rename cases, etc.).

So knowing how much control you have over the environment is pretty important.

jrtipton
I'm going ahead loosely on this approach. Thanks mate!
lalli