tags:

views:

39

answers:

2

I have a database where each entry has a file path and a last modified field:

1284581625555  C:\docs\text1.txt
1284581646992  C:\docs\text2.txt
1284581654886  C:\docs\text3.txt
1284581662927  C:\docs\subfolder\text4.txt
1284581671986  C:\docs\subfolder\text5.txt
...

Each entry also has a summary of the file contents, and the entries were created by recursively walking down a certain folder (in this case C:\docs) and adding all visited files. Now I'd like to update the database, i.e.

  • Add newly created files
  • Remove deleted files
  • Update modified files

Obviously, I have to walk down the root folder again to see what has changed. But what is the most efficient way to do so?

There are two approaches I can think of:

  • First traverse the database, remove all deleted entries and update all modified entries. For this, each time you have to create a file object from the the stored path string, and call file.exists() or file.isModified(). Then recursively walk down the root folder and add files which aren't in the database yet.
  • First walk down the file tree and remember in a list what has been added/deleted/modified --- this requires having stored a complete snapshot of the previous state of the file tree. Then traverse the database and add/delete/modify entries, based on the previously created list.

Which approach is better? Are there any other?

EDIT: Creating the summary is very expensive (full text extraction), and traversing the database is also somewhat expensive, since it is file-based.

+2  A: 

I would think that the easiest way to do this would be to delete and recreate the file. Depending on how difficult it is to create the "summary", this could well be the fastest method since you don't need to compare or edit anything.

If the summary creation is "hard" and the database fits in memory, the easiest way to go would probably be to load the database into a dict (keyed on the filename, with data indicating whether or not the file has been "seen") and do the os.walk again, updating the dict as necessary. Then iterate the dict, writing all entries that have been seen.

(BTW the last modified field isn't necessarily useful, you have to check the file's modified time anyway so might as well compare it to the database's timestamp.)

dash-tom-bang
+1  A: 

Probably the best way to handle this is to re-walk the tree again in its entirety. That way, instead of calling File.exist() all the time, you are only calling Directory.list() once per directory. This saves you on file IO calls, which is most likely the bottleneck in this situation.

Once you have a list of the currently-existing files, you can compare the two lists, and determine for each file:

  • New file
  • Deleted file
  • Altered file

and proceed accordingly.

lacqui