tags:

views:

159

answers:

4

I have a directory with a large number of files (~1mil). I need to choose a random file from this directory. Since there are so many files, os.listdir naturally takes an eternity to finish.

Is there a way I can circumvent this problem? Maybe somehow get to know the number of files in the directory (without listing it) and choose the 'n'th file where n is randomly generated?

The files in the directory are randomly named.

+1  A: 

I'm not sure this is even possible. Even at the VFS or filesystem level, there is no guarantee that a directory entry count is even maintained. For instance many filesystems simply record the combined byte size of the directory entry structures contained in a given directory.

Estimation may be made if directory entries are fixed size structures, but this is uncommon now (consider LFN for FAT32). Even if a given filesystem did provide an entry count without needing to iterate through a directory, or if the VFS cached a record of a directories length, these would definitely be operating system, filesystem, and kernel specific.

Matt Joiner
Would it help if all the files in the directory are symbolic links? On my system, all these links are 512B in size. So could we possibly extract the number of files using this and the combined directory size information?
NoneType
I'm very hopeful that I'm wrong, I'm keen to see a nice technical answer to your question.
Matt Joiner
A: 

You may be able to get this running:

http://mail.python.org/pipermail/python-list/2009-July/1213182.html

And that's probably the best possible solution to your problem, but only where n is small - if n goes large then os.listdir is probably just as good for your purpose.

I've hunted around and failed to find any other way to open a file in a directory. If I had more time I'd be inclined to play around a bit and generate my own ~1mil files.


I just thought of another way to do this: Assuming the files are constant - you're not getting any more or less - you could keep a list of the filenames in a sqlite database. Then it would be relatively simple to query the database for a name by a random ROWID. I don't know if you'll still be plagued by the long time to search for the correct file, but at least getting a filename should take a short amount.

Of course if the files in the directory are randomly named, you can rename the files(?) and put them into a directory structure like AdamK suggests.

Wayne Werner
I shall try the `listdir` generator function along with the random sampling heuristic that Nas Banov suggested. (i.e., uniform sampling across all file names while reading them one by one)
NoneType
A: 

try this, (here is very fast with 50K files...)

import glob
import random

list = glob.glob("*/*.*")
print list[random.randrange(0,list.__len__())]
olarva
This takes and equally large amount of time.
NoneType
pls note `random.randrange(0,list.__len__())` is better written as `random.randrange(len(list))`
Nas Banov
+1  A: 

Alas, I don't think there is a solution to your problem. One, I don't know of portable API that will return you the number of entries in directory (w/o enumerating them first). Two, I don't think there is API to return you directory entry by number and not by name.

So overall, a program will have to enumerate O(n) directory entries to get a single random one. The trivial approach of determining number of entries and then picking one will either require enough RAM to hold the full listing (os.listdir()) or will have to enumerate 2nd time the directory to find the random(n) item - overall n+n/2 operations on average.

There is slightly better approach - but only slightly - see randomly-selecting-lines-from-files. In short there is a way to pick random item from list/iterator with unknown length, while reading one item at a time and ensure that any item may be picked with equal probability. But this won't help with os.listdir() because it already returns list in memory that already contains all 1M+ entries - so you can as well ask it about len() ...

Nas Banov
This is a good idea, I'm tempted to try this using the `os.listdir` generator function that Wayne suggested.
NoneType
@NoneType: If you'd like to toy with it, sure. But I don't think an improvement of only 2x is worth the effort; you should be shooting for something linear or logarithmic. For that though you should be able to change the problem somehow... why exactly do you need to do this random selection of file, what's the need behind it? Do you have better knowledge of the file naming schema?
Nas Banov