Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Orignial post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path effected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")