views:

730

answers:

6

Programs that index filesystems seem to know which parts have changed since their last index and only rescan that part. How can I determine where the filesystem/files have changed since my last index. Don't care what language you answer in but I'm thinking c and windows.

An example of such a program is Sequoia View which generates a treemap of your hard disk.

+4  A: 

A fairly simplistic method would be to take the file system's reported files sizes, dates (as integer values), and file names in a given directory and calculate a checksum you could then associate with that directory. You would still need to perform this calculation on all of the directories using file system data but you wouldn't have to go in depth (opening files to check for differences) unless a checksum reported a difference.

For tracking specific changes at the file level you would store checksums based on individual file attributes, and of the course the presence of or absence of files and subdirectories since the last scan.

This wouldn't necessarily guarantee that changes have not occurred as there are file system utilities for altering all manner of attributes though it would be a good first step for a basic scan.

You may find the source code for fswatch helpful.

cfeduke
+1  A: 

FindFirstChangeNotification

Windows programmer
+4  A: 

If you were coding in a .Net managed language, try out the FileSystemWatcher class.

From MSDN:

Use FileSystemWatcher to watch for changes in a specified directory. You can watch for changes in files and subdirectories of the specified directory. You can create a component to watch files on a local computer, a network drive, or a remote computer.

To watch for changes in all files, set the Filter property to an empty string ("") or use wildcards ("."). To watch a specific file, set the Filter property to the file name. For example, to watch for changes in the file MyDoc.txt, set the Filter property to "MyDoc.txt". You can also watch for changes in a certain type of file. For example, to watch for changes in text files, set the Filter property to "*.txt".

Aydsman
+3  A: 

Look into directory change notifications.

Ferruccio
+2  A: 

You have 2 issues to deal with here.

The first is if you want to watch for dynamic changes (made while your program is running). In that case, you need to use the Windows API ReadDirectoryChangesW. There are plenty of on-line examples for how to use it. (Beware... some examples are not very good. This API call CAN AND WILL return more than one event for each call and you need to read the interface carefully, understand how it works, and process EVERYTHING that gets returned.

The second issue is if you have a folder, or list of folders, and you want to check if its / their contents have changed - either by adding/deleting or changing files in that folder.

In this case, the most effective method is to read the folder contents a file name at a time, and make a cumulative hash. More than that, though, you also want to get the attributes (using something like GetFileAttributesEx), and include those in the hash as well. (make sure to exclude the folders "." and ".." - or the results will be misleading.)

The reason for this is that you want to catch changes in a file by its size, dates, etc. You probably dont want to include the LastAccessed time though.

Any big hashing function should do. The result is a single big number (the hash) for each folder.

Then when you make another pass over, you re-compute the hash and compare with the stored hash for the last known state of that folder. If the hashes don't match, then you need to go poking through the folder in detail.

Effectively, this approach tells you (quickly) that there is something here you need to look at in more detail, and how you do that depends on what you are trying to achieve.

This has the advantage that you are not looking at the contents of each file in the folder, but instead at some meta-data which gives you enough of an indication. The processing is thus many thousands of times faster.

+1  A: 

Under Linux (and any other Unix-like OS I suppose) one could generate a hash value for a file/folder to represent its state at a given time. Later, just regenerate the hash and compare that with the old value. This proved to be very effective for some of the projects I was working on!

Details are here: http://valeriu.palos.ro/169/recursive-filedirectory-change-detection/

It is sensitive to basically any change (even when only changing the access time of a file).

Valeriu Paloş