ansaurus

Question

Retrieving info of a very large directory

Answer 1

+4 A:

A better way is to design your directory layout so there's no way you'll ever have 32000 files in a single directory. In fact, I'd suggest that even 1000 files in a directory is too many.

The approach I usually take to this problem involves extra levels of directory hierarhy. A typical way is to take the file names you're currently storing in a single directory and break it up into pieces that correspond to nested directories. So, if you have a bunch of files like

xyzzy
foo
bar
blah

you might in fact store them as

x/xyzzy
f/foo
b/bar
b/blah

and so on. You can extend this to multiple directory levels, or use more than one character to split the subdirectories, in order to trade off the depth versus breadth of this approach.

You'll probably get suggestions that you use a file system that doesn't have the 32k file limit. Personally, even with such a file system, I would always use a scheme like I'm proposing here. It's almost impossible to work effectively with command-line tools in directories with very very large numbers of files (even ls becomes entirely unwieldy), and this sort of manual exploration is always needed during development, debugging, and often from time to time during normal operation.

Dale Hagglund 2009-08-28 06:20:18

Whoops! I forgot to answer the bonus question: if by "non-iterative way to find the disk usage of a directory" you mean a way that doesn't involve looking at the size of every file in the directory, then the answer is no.

Dale Hagglund 2009-08-28 06:22:00

Dale - thanks. My application requires that each user gets a directory, and I have more than 32000 users. But let's say, as an extreme example, that I had 1,000,000 users - how could my filesystem ever really avoid large directories in that case?

Brian 2009-08-28 06:30:54

More levels of hierarchy gives you exponential growth in the number of names you can store. For example, 1000000 is 1000 squared. Assuming alphabetic user names, 26^2 is 676, or roughly 1000, so you might consider file names like "us/er/user1", "us/er/user2", "xy/zz/xyzzy", "pl/ug/plugh", and so on.

Dale Hagglund 2009-08-28 06:48:03

I should add that this simple scheme as described is susceptible to an uneven distribution of names. Ie, you've got a problem if every one of your million users chooses a name starting with "user". You can address this by hashing the user names and creating the file names from the hash (note that git does this for its object storage). Eventually, it's possible you'll want to consider a more complete database, although whether a key/value system like bdb or something sql-based would depend on how you want to manipulate and search your data.

Dale Hagglund 2009-08-28 06:51:44

I suppose you have some unique identifier of the user (e.g. ID) which is numeric? Then you can use it to format it like XXXXXXXX (for example first user would be 00000001) and you can split it in groups of two numbers: users/00/00/00/01/(user's stuff); users/01/12/32/41/(user's stuff). We use this algorithm for storing tons of images and it works nice for now.

bisko 2009-08-28 07:13:40

Thanks for the helpful answers.

Brian 2009-08-28 16:16:35

ansaurus

tags:

views:

answers:

Retrieving info of a very large directory

related questions