ansaurus

Question

What is the best folder distribution on Linux for user private folders?

Answer 1

+1 A:

If your user ID values are fairly uniformly distributed and the number will continue to increase, then you should probably balance up the tree a bit more. What's best depends in part on where you think you'll end up in terms of numbers. Big directories are slower to search than small ones. While 800 files is not awful, it isn't great either. If you want to stick with 2 tiers and you have N users (as your target population), then you should aim for sqrt(N) folders in the first tier, with sqrt(N) folders in each second tier directory. For N = 80,000, that means about 300 folders per level. If you want to consider a 3 tier arrangement, replace the square root with the cube root. You might also find that using modulo arithmetic gives you a smoother distribution. That is, the first level might be better calculated as:

var first_level = (int) ($user_id % 300);

Assuming your unidentified language uses % for its modulo operator.

CPAN uses a system based on 3 tiers: first tier is the first letter of the user's login ID; the second tier is the first two letters, and the third tier is the full login ID.

I read somewhere that some site (university-based, IIRC) found that first and last letter of name gave a good system.

Jonathan Leffler 2009-01-03 02:03:50

Answer 2

+1 A:

A popular scalable folder naming scheme if you don't care about readability is something like that squid uses: <4-bit>/<8-bit>/<remaining-116-bit-of-md5-of-whatever-lookup-key> or <whatever-unique-key-you-have>, so for user-id 1 the folder path can be /c4/ca42/1.

In this case, the first level is up to 16 directory and second level is up to 256 directories.

The big advantage of this approach is that the distribution of the folders are statistically uniform, regardless whether you have holes or clusters in your userids/usernames (smaller user ids tend to get unused due to attrition.)

obecalp 2009-01-03 02:12:49

Answer 3

+1 A:

You don't say what filesystem is used to store the files. It should be easy for you to create a random directory tree with the characteristics you expect of your real load. Then you can run experiments that will tell you the performance of the various strategies you are considering.

I couldn't easily find information about which filesystems use efficient data structures like B-trees for large directories. I did find a claim that the MacOS HFS does. I would look into XFS or another high-performance, journaling filesystem.

Norman Ramsey 2009-01-03 02:39:26

ansaurus

tags:

views:

answers:

What is the best folder distribution on Linux for user private folders?

related questions