tags:

views:

1361

answers:

7

I'm looking for a high performance method or library for scanning all files on disk or in a given directory and grabbing their basic stats - filename, size, and modification date.

I've written a python program that uses os.walk along with os.path.getsize to get the file list, and it works fine, but is not particularly fast. I noticed one of the freeware programs I had downloaded accomplished the same scan much faster than my program.

Any ideas for speeding up the file scan? Here's my python code, but keep in mind that I'm not at all married to os.walk and perfectly willing to use others APIs (including windows native APIs) if there are better alternatives.

for root, dirs, files in os.walk(top, topdown=False):
    for name in files:
        ...

I should also note I realize the python code probably can't be sped up that much; I'm particularly interested in any native APIs that provide better speed.

A: 

The os.path module has a directory tree walking function as well. I've never run any sort of benchmarks on it, but you could give it a try. I'm not sure there's a faster way than os.walk/os.path.walk in Python, however.

sli
os.path.walk() is rather clumsy and is being removed in Python 3, so please don't use it.
Benjamin Peterson
+3  A: 

Well, I would expect this to be heavily I/O bound task. As such, optimizations on python side would be quite ineffective; the only optimization I could think of is some different way of accessing/listing files, in order to reduce the actual read from the file system. This of course requires a deep knowledge of the file system, that I do not have, and I do not expect python's developer to have while implementing os.walk.

What about spawning a command prompt, and then issue 'dir' and parse the results? It could be a bit an overkill, but with any luck, 'dir' is making some effort for such optimizations.

Roberto Liffredo
+3  A: 

It seems as if os.walk has been considerably improved in python 2.5, so you might check if you're running that version.

Other than that, someone has already compared the speed of os.walk to ls and noticed a clear advance of the latter, but not in a range that would actually justify using it.

Ole
Thanks Ole, I am using Python 2.5 .
Parand
+1  A: 

When you look at the code for os.walk, you'll see that there's not much fat to be trimmed.

For example, the following is only a hair faster than os.walk.

import os
import stat

listdir= os.listdir
pathjoin= os.path.join
fstat= os.stat
is_dir= stat.S_ISDIR
is_reg= stat.S_ISREG
def yieldFiles( path ):
    for f in listdir(path):
        nm= pathjoin( path, f )
        s= fstat( nm ).st_mode
        if is_dir( s ):
            for sub in yieldFiles( nm ):
                yield sub
        elif is_reg( s ):
            yield f
        else:
            pass # ignore these

Consequently, the overheads must he in the os module itself. You'll have to resort to making direct Windows API calls.

Look at the Python for Windows Extensions.

S.Lott
+2  A: 

You might want to look at the code for some Python version control systems like Mercurial or Bazaar. They have devoted a lot of time to coming up with ways to quickly traverse a directory tree and detect changes (or "finding basic stats about the files").

Benjamin Peterson
+1  A: 

I'm wondering if you might want to group your I/O operations.

For instance, if you're walking a dense directory tree with thousands of files, you might try experimenting with walking the entire tree and storing all the file locations, and then looping through the (in-memory) locations and getting file statistics.

If your OS stores these two data in different locations (directory structure in one place, file stats in another), then this might be a significant optimization.

Anyway, that's something I'd try before digging further.

Triptych
A: 

This is only partial help, more like pointers; however:

I believe you need to do the following:

fp = open("C:/$MFT", "rb")

using an account that includes SYSTEM permissions, because even as an admin, you can't open the "Master File Table" (kind of an inode table) of an NTFS filesystem. After you succeed in that, then you'll just have to locate information on the web that explains the structure of each file record (I believe it's commonly 1024 bytes per on-disk file, which includes the file's primary pathname) and off you go for super-high speeds of disk structure reading.

ΤΖΩΤΖΙΟΥ
The "for windows" part of the post title might've been relevant ;-)Messing with the MFT sounds a bit hairy, but I'll do a little digging to see if the format is easy to play with.
Parand
Oh, you're absolutely right and I'm absolutely blind! :)
ΤΖΩΤΖΙΟΥ