views:

350

answers:

6

Attempt #2:

People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:

1) Reading a list of files is much faster than walking a directory.

2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.

3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.

Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.

Orignial post:

I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.

Here's my idea for how to go about this:

I have a function called all_files:

def all_files(dir_path, ...parms...):
    ...

The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.

This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path effected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.

Finally, I'll take the list of all files, filter it according to various parameters, and return the result.

What do you think of my approach? Is there a better way to do this?

Edit:

Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.

import os
from os.path import join, exists
import cProfile, pstats

dir_name = "temp_dir"
index_path = ".index"

def create_test_files():
    os.mkdir(dir_name)
    index_file = open(index_path, 'w')
    for i in range(10):
        print "creating dir: ", i
        sub_dir = join(dir_name, str(i))
        os.mkdir(sub_dir)
        for i in range(100):
            file_path = join(sub_dir, str(i))
            open(file_path, 'w').close() 
            index_file.write(file_path + "\n")
    index_file.close()
#

#  0.238 seconds
def test_walk():            
    for info in os.walk("temp_dir"):
        pass

#  0.001 seconds
def test_read():
    open(index_path).readlines()

if not exists("temp_dir"):
    create_test_files()

def profile(s):
    cProfile.run(s, 'profile_results.txt')
    p = pstats.Stats('profile_results.txt')
    p.strip_dirs().sort_stats('cumulative').print_stats(10)

profile("test_walk()")
profile("test_read()")
+7  A: 

Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.

Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.

Flaws and potential problems:

You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.

The filesystem monitor program still has to recursively walk the file system, so the work is still being done.

In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.

Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.

I could go on.

If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.

Adam Crossland
That's a good point. The FS is already doing this more or less.
SoloBold
It means you can't build universal solution. But if you know something your FS doesn't know (like the files only get updated at 5pm daily), you can use this knowledge to cache the information you need from FS.
Antony Hatchkins
Look at the code sample I added above. There is a clear, drastic improvement from reading a list of files over walking a dir.Please elaborate on the flaws you see.
Jesse Aldridge
The flaw is that your profiling is completely dishonest. (Though, not intentionally.) You compared an os.walk to reading a file. However, the file was created by an equivalent to os.walk. Your performance is going to be (a) os.walk or (b) os.walk to create an index + read the index. When you profiled it, you didn't count the all of the work in (b). Your testing setup did the vast majority of the work of (b), and then you only profiled the last tiny step.
Travis Bradshaw
Jesse, my flaws are going to be listed in my answer to your question.
Adam Crossland
@Travis Yes, but I only need to create the index file *the first time*. On subsequent calls I can read the index file and avoid the walk. That's what caching is all about.
Jesse Aldridge
Jesse, caching is only as good as the probability that the information that is cached is still accurate and useful.
Adam Crossland
@Adam Thanks for the elaboration, but... 1) Seeing as how I'll be updating the index immediately before I return the list of files, the risk of being out of sync seems about the same as using os.walk. 2) No it doesn't. The *indexer* recursively walks the first time. The *monitor* is hooked up to Windows signals. 3) No I don't. Again, Windows signals. I should have mentioned that in my question, sorry. 4) The index is updated by the all_files function just before returning. There will be some slowdown, but I suspect it will still be significantly faster than walking the dir.
Jesse Aldridge
Jesse -- good luck.
Adam Crossland
Heh. Thanks for the effort.
Jesse Aldridge
A: 

I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.

As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.

jathanism
Yes, I'm already using walk and stat. But my function is slow and I think this could make it significantly faster.
Jesse Aldridge
Ah, ok then. You might want to consider one of the awesome search apps out there that operate in a Django-esque ORM style. There are a few listed here, the most popular of which seems to be Whoosh: http://haystacksearch.org/docs/installing_search_engines.html
jathanism
I've actually used Whoosh and SOLR. I think they are more suited to full text search than retrieving all files and filtering on attributes. I don't think something like that would work well for this case.
Jesse Aldridge
Ahh, that's a bummer. Well, sorry I couldn't help, I was thinking that indexing features would be useful.
jathanism
+2  A: 

Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.

Of course it's much faster than walking the directory.

Till Backhaus
Thank you for apparently being the only person on StackO to understand that. I hadn't thought of looking at Windows Search. It does indeed have indexing options. But something tells me trying to integrate that indexing with my function would be more trouble than it's worth...
Jesse Aldridge
The hard part is indeed to keep the index in sync. I'd assume that you are better of if you use the index that is already there.
Till Backhaus
A: 

I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.

class DirectoryIterator:
    def __init__(self, start_dir, pattern):
        self.directory = start_dir
        self.pattern = pattern

 def __iter__(self):
     [([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]

 ###########

 for file_name in DirectoryIterator(".", "\.py$"): print file_name
null
Ah, but the bottleneck is os.walk, and your example still needs to call that.
Jesse Aldridge
A: 

The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.

"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)

There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.

To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.

Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.

There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.

However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.

Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.

(Edit in response to "I don't think you understand what I'm trying to do")

Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.

Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.

You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.

If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?

That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.

But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.

Travis Bradshaw
Leveraging the Windows Desktop Search indexing is a nice idea. But I have no idea how to do that. Also, my method really isn't all that complicated or hard to implement. // I think you're misunderstanding what I'm trying to do. I've restated my question in an attempt to be more clear. The thing is I only need to write the index *the first time* I call the function and *subsequent calls* are sped up because I can just read the index and no longer need to walk. I keep the index up to date by applying deltas from the monitor on subsequent calls.
Jesse Aldridge
+1  A: 

The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.

Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.

Jesse Aldridge