views:

237

answers:

3

In our site, users can have many private files. We are thinking what could be the best distribution so as to avoid destroying the server's performance, These files are served through Apache and should be listed each time the user needs to manage them.

Our first approach right now is:

var first_level = (int) $user_id/100;
var files_folder = /uf/$first_level/$user_id

This gives us a first level of 100 folders and many second level folders. Since not all users have files and right now we are at about ~80k users, this means about 800 folders per second level folder.

What do you think about this approach?

+1  A: 

If your user ID values are fairly uniformly distributed and the number will continue to increase, then you should probably balance up the tree a bit more. What's best depends in part on where you think you'll end up in terms of numbers. Big directories are slower to search than small ones. While 800 files is not awful, it isn't great either. If you want to stick with 2 tiers and you have N users (as your target population), then you should aim for sqrt(N) folders in the first tier, with sqrt(N) folders in each second tier directory. For N = 80,000, that means about 300 folders per level. If you want to consider a 3 tier arrangement, replace the square root with the cube root. You might also find that using modulo arithmetic gives you a smoother distribution. That is, the first level might be better calculated as:

var first_level = (int) ($user_id % 300);

Assuming your unidentified language uses % for its modulo operator.

CPAN uses a system based on 3 tiers: first tier is the first letter of the user's login ID; the second tier is the first two letters, and the third tier is the full login ID.

I read somewhere that some site (university-based, IIRC) found that first and last letter of name gave a good system.

Jonathan Leffler
+1  A: 

A popular scalable folder naming scheme if you don't care about readability is something like that squid uses: <4-bit>/<8-bit>/<remaining-116-bit-of-md5-of-whatever-lookup-key> or <whatever-unique-key-you-have>, so for user-id 1 the folder path can be /c4/ca42/1.

In this case, the first level is up to 16 directory and second level is up to 256 directories.

The big advantage of this approach is that the distribution of the folders are statistically uniform, regardless whether you have holes or clusters in your userids/usernames (smaller user ids tend to get unused due to attrition.)

obecalp
+1  A: 

You don't say what filesystem is used to store the files. It should be easy for you to create a random directory tree with the characteristics you expect of your real load. Then you can run experiments that will tell you the performance of the various strategies you are considering.

I couldn't easily find information about which filesystems use efficient data structures like B-trees for large directories. I did find a claim that the MacOS HFS does. I would look into XFS or another high-performance, journaling filesystem.

Norman Ramsey