views:

99

answers:

3

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).

So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?

If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.

+1  A: 

The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).

However, if that is unacceptable to you for some reason, let us try to look at the original problem.

First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.

Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.

Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).

Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.

First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.

Second, you don't actually have to query the database at all :-) Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).

Hope this helps. Have fun.

Fyodor Soikin
Great answer! I thought about querying the database for each file (yes, I have an index), however that might leave orphaned entries in the database (entries for files that don't exist anymore). Since I query the database to respond to user queries, the only way to reconcile this would be to check the filesystem to make sure it's there when I'm displaying a result, a less-than-ideal solution.
Robert Fraser
Regarding deleted files.You could use the directory's 'last modified time' for this. Every time a file is created, deleted, or renamed, in a directory, that directory is considered 'modified', and it's 'last modified time' is updated. You can use this to figure out directories which don't have any created or deleted files. For such directories, you go through files and just check their 'last modified' date. And for directories that do have some created/deleted files, you just fall back to building the dictionary.
Fyodor Soikin
If you think about it carefully and long enough, you can probably come up with some other strategies for reducing search space. Taking a walk in a park also helps a lot :-)
Fyodor Soikin
Ah; didn't think of that.
Robert Fraser
+1  A: 

Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798%28VS.85%29.aspx

EDIT: I think it requires rather high privileges, unfortunately

Robert Obryk
This looks like exactly what I want, thanks! I'll look into the privilege issue, as long as my users don't have to run as admin, all is good.
Robert Fraser
Sigh, indeed it does require admin. But cool trick nonetheless.
Robert Fraser
Change journals are not intended for general-purpose user-oriented applications. They are a very powerful, but also (as a flip side) very potentially destructive technology. They are only to be used for very special-purpose application, like backup or maintaining internal integrity of the file system itself.
Fyodor Soikin
+2  A: 

Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:

  • In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
  • Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.

Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):

On 1st Launch

  • The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
  • Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
  • This takes a while, so 1st launch can be a bit slow.

There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:

n+1 Launch

  • The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
  • The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
  • The Walker then walks the filesystem to pick-up new files.
  • Finally the Loader loads all UNPARSED files

Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:

  • The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new data is loaded (so the application "keeps working" while new data is cached into the database)
  • The Loader could load/parse on a background thread during an idle time in the main application
  • If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.

Just some thoughts / suggestions. Hope they help!

Kassini
Very detailed answer, great ideas! How is performance?
Robert Fraser