ansaurus

Question

fast folder size calculation in Python on Windows

Answer 1

+1 A:

You don't need to use a recursive algorithm if you use os.walk. Please check this question.

You should time both approaches, but this is supposed to be much faster:

import os

def get_dir_size(root):
    size = 0
    for path, dirs, files in os.walk(root):
        for f in files:
            size +=  os.path.getsize( os.path.join( path, f ) )
    return size

jbochi 2009-12-31 21:54:28

The approach you are proposing takes 139 seconds instead of 72 seconds. We were using that one before and it is much slower.

Laurent Luce 2009-12-31 22:47:03

So, you already got a speedup of almost 100%, and you're still no satisfied? Greedy basterd! ;-) ;-)

jae 2009-12-31 23:15:37

Is it possible that the approach that I proposed run slower because it's going through all the files while yours is skipping sytems dirs and folders in the list DIR_EXCLUDES?

jbochi 2009-12-31 23:24:55

Instead of using for loops, you could try map and reduce. For thousands of files the performance benefit could be significant.

jbochi 2009-12-31 23:42:11

@jbochi, 90%+ of the time consumed is in accessing the filesystem, so it's unlikely much improvement could be seen using things like map().

Peter Hansen 2010-01-01 02:02:32

Answer 2

+1 A:

I don't have a Windows box to test on at the moment, but the documentation states that win32file.FindFilesIterator is "similar to win32file.FindFiles, but avoid the creation of the list for huge directories". Does that help?

ephemient 2009-12-31 23:18:17

It didn't here. Slightly slower, in fact, which perhaps isn't too surprising considering that the list is built in C code and there should be less overhead scanning a list than using an iterator.

Peter Hansen 2010-01-01 02:06:06

Answer 3

+1 A:

It's a whopper of a directory tree. As others have said, I'm not sure you can speed it up... not like that, cold w/o data. And that means...

If you can cache data, somehow (not sure what the actual implication is), then you could speed things up (I think... as always, measure, measure, measure).

I don't think I have to tell you how to do caching, I guess, you seem like a knowledgeable person. And I wouldn't know off the cuff for Windows anyway. ;-)

jae 2009-12-31 23:21:18

The approach we are going to take is compute the folders size in the background in our app so they are ready when the users asked for them.

Laurent Luce 2009-12-31 23:38:26

Answer 4

+1 A:

This jumps out at me:

try:
  items = win32file.FindFilesW(path + '\\*')
except Exception, err:
  return 0

Exception handling can add significant time to your algorithm. If you can specify the path differently, in a way that you always know is safe, and thus prevent the need to capture exceptions (eg, checking first to see if the given path is a folder before finding files in that folder), you may find a significant speedup.

Robert P 2009-12-31 23:46:43

Actually, try/except blocks in Python (contrary to experience from some other languages) are very cheap when exceptions are not raised, and in any case that code is there to catch problems that could not be determined in advance (such as "access denied" on certain items) so it can't really be avoided.

Peter Hansen 2010-01-01 01:43:17

Answer 5

+4 A:

A quick profiling of your code suggests that over 90% of the time is consumed in the FindFilesW() call alone. This means any improvements by tweaking the Python code would be minor.

Tiny tweaks (if you were to stick with FindFilesW) could include ensuring DIR_EXCLUDES is a set instead of a list, avoiding the repeated lookups on other modules, and indexing into item[] lazily, as well as moving the sys.platform check outside. This incorporates these changes and others, but it won't give more than a 1-2% speedup.

DIR_EXCLUDES = set(['.', '..'])
MASK = win32con.FILE_ATTRIBUTE_DIRECTORY | win32con.FILE_ATTRIBUTE_SYSTEM
REQUIRED = win32con.FILE_ATTRIBUTE_DIRECTORY
FindFilesW = win32file.FindFilesW

def get_dir_size(path):
    total_size = 0
    try:
        items = FindFilesW(path + r'\*')
    except pywintypes.error, ex:
        return total_size

    for item in items:
        total_size += item[5]
        if (item[0] & MASK == REQUIRED):
            name = item[8]
            if name not in DIR_EXCLUDES:
                total_size += get_dir_size(path + '\\' + name)

    return total_size

The only significant speedup would come from using a different API, or a different technique. You mentioned in a comment doing this in the background, so you could structure it to do an incremental update using one of the packages for monitoring changes in folders. Possibly the FindFirstChangeNotification API or something like it. You could set up to monitor the entire tree, or depending on how that routine works (I haven't used it) you might be better off registering multiple requests on various subsets of the full tree, if that reduces the amount of searching you have to do (when notified) to figure out what actually changed and what size it is now.

Edit: I asked in a comment whether you were taking into account the heavy filesystem metadata caching that Windows XP and later do. I just checked performance of your code (and mine) against Windows itself, selecting all items in my C:\ folder and hitting Alt-Enter to bring up the properties window. After doing this once (using your code) and getting a 40s elapsed time, I now get 20s elapsed from both methods. In other words, your code is actually just as fast as Windows itself, at least on my machine.

Peter Hansen 2010-01-01 02:28:20

Thanks for this enhanced version even if the gain is small. It seems that we a reached a limit so we are going to run our process in background and keep the folders size list up to date using the change notification Windows supports. Same idea as you are suggesting.

Laurent Luce 2010-01-01 04:23:42

ansaurus

tags:

views:

answers:

fast folder size calculation in Python on Windows

related questions