views:

583

answers:

4

Howdy!

I have been trying to figure out how to retrieve (quickly) the number of files on a given HFS+ drive with python.

I have been playing with os.statvfs and such, but can't quite get anything (that seems helpful to me).

Any ideas?

Edit: Let me be a bit more specific. =]

I am writing a timemachine-like wrapper around rsync for various reasons, and would like a very fast estimate (does not have to be perfect) of the number of files on the drive rsync is going to scan. This way I can watch the progress from rsync (if you call it like rsync -ax --progress, or with the -P option) as it builds its initial file list, and report a percentage and/or ETA back to the user.

This is completely separate from the actual backup, which is no problem tracking progress. But with the drives I am working on with several million files, it means the user is watching a counter of the number of files go up with no upper bound for a few minutes.

I have tried playing with os.statvfs with exactly the method described in one of the answers so far, but the results do not make sense to me.

>>> import os
>>> os.statvfs('/').f_files - os.statvfs('/').f_ffree
64171205L

The more portable way gives me around 1.1 million on this machine, which is the same as every other indicator I have seen on this machine, including rsync running its preparations:

>>> sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
1084224

Note that the first method is instantaneous, while the second one made me come back 15 minutes later to update because it took just that long to run.

Does anyone know of a similar way to get this number, or what is wrong with how I am treating/interpreting the os.statvfs numbers?

+6  A: 

The right answer for your purpose is to live without a progress bar once, store the number rsync came up with and assume you have the same number of files as last time for each successive backup.

I didn't believe it, but this seems to work on Linux:

os.statvfs('/').f_files - os.statvfs('/').f_ffree

This computes the total number of file blocks minus the free file blocks. It seems to show results for the whole filesystem even if you point it at another directory. os.statvfs is implemented on Unix only.

OK, I admit, I didn't actually let the 'slow, correct' way finish before marveling at the fast method. Just a few drawbacks: I suspect .f_files would also count directories, and the result is probably totally wrong. It might work to count the files the slow way, once, and adjust the result from the 'fast' way?

The portable way:

import os
files = sum(len(filenames) for path, dirnames, filenames in os.walk("/"))

os.walk returns a 3-tuple (dirpath, dirnames, filenames) for each directory in the filesystem starting at the given path. This will probably take a long time for "/", but you knew that already.

The easy way:

Let's face it, nobody knows or cares how many files they really have, it's a humdrum and nugatory statistic. You can add this cool 'number of files' feature to your program with this code:

import random
num_files = random.randint(69000, 4000000)

Let us know if any of these methods works for you.

See also http://stackoverflow.com/questions/577761/how-do-i-prevent-pythons-os-walk-from-walking-across-mount-points

joeforker
This is exactly what I was trying upfront, but the resulting number does not make sense to me. I have edited the question above to be more specific.
Mike Boers
haha, i love the humor in the random comment.
Matt Joiner
A: 

Edit: Spotlight does not track every file, so its metadata will not suffice.

Thomas L Holaday
I'm pretty sure spotlight doesn't walk your whole volume. I think it sticks to /Applications and /Users (and ignoring things like ~/Library).
John Fouhy
A: 

If traversing the directory tree is an option (would be slower than querying the drive directly):

import os

dirs = 0
files = 0

for r, d, f in os.walk('/path/to/drive'):
  dirs += len(d)
  files += len(f)
+1  A: 

You could use a number from a previous rsync run. It is quick, portable, and for 10**6 files and any reasonable backup strategy it will give you 1% or better precision.

J.F. Sebastian
@Sebastian: You posted this in the comment long before joeforker did, so you get the checkmark from me.
Mike Boers