views:

396

answers:

5

I've got a simple file host going that gives files a unique id and just stores them in a directory. I've been told that this will cause problems in the future, and I'm wondering what things I should look out for to make sure it works smoothly into the future and beyond.

Also, is there a performance issue with forcing downloads by sending header information and readfile()? Would it be better to preserve file names and allow uses to direct download isn't of using a script?

Thanks

A: 

I'm my opinion I suggest using some script to control abuse. Also I suggest to preserve file names unless your script will create an index on a database in relation to it's original state. You could also try to do a script with some Rewrite magic on it, this way bringing another layer of security by not exposing the real name behind (your unique id) to the end user.

Codex73
+5  A: 

The kind of problems you have been told about very likely have to do with the performance impact of piling thousands and thousands of files in the same directory.

To circumvent this, do not store your files directly under one directory, but try to spread them out under subdirectories (buckets).

In order to achieve this, look at the ID (let's say 19873) of the file you are about to store, and store it under <uploads>/73/98/19873_<filename.ext>, where 73 is ID % 100, 98 is (ID / 100) % 100 etc.

The above guarantees that you will have at most 100 subdirectories under <uploads>, and at most 100 further subdirectories underneath <uploads>/*. This will thin out the number of files per directory at the leaves significantly.

Two levels of subdirectories are typical enough, and represent a good balance between not wasting too much time resolving directory or file names to inodes both in breadth (what happens when you have too many filenames to look through in the same directory - although modern filesystems such as ext3 will be very efficient here) and depth (what happens when you have to go 20 subdirectories deep looking for your file). You may also elect to use larger or smaller values (10, 1000) instead of 100. Two levels with modulo 100 would be ideal for between 100k and 5M files

Employ the same technique to calculate the full path of a file on the filesystem given the ID of a file that needs to be retrieved.

Cheers, V.

vladr
+3  A: 

Your first question really depends on the type of file system you are using. I'll assume ext3 without any journaling optimizations when answering.

First, yes, many files in one place could cause a problem when the number of files exceeds the system ARG_MAX. In other words, rm -rf * would quit while complaining about too many arguments. You might consider having diretories A-Z / a-z and parking the files appropriately based on the value of the left most byte in its unique name.

Also, try to avoid processes that will open all of those files in a short period of time... crons like 'updatedb' will cause problems once you really start filling up. Likewise, try to keep those directories out of the scope of commands like 'find'.

That leads to the other potential issue, buffers. How frequently are these files accessed? If there were 300 files in a given directory, would all of them be accessed at least once per 30 minutes? If so, you'll likely want to turn up the /proc/sys/vfs_cache_pressure setting so that Linux will reclaim more memory and make it available to PHP/Apache/Etc.

Finally, regarding readfile ... I would suggest just using a direct download link. This avoids PHP having to stay alive during the course of the download.

Tim Post
+1  A: 

If you're likely to have thousands of files, you should spread them among many subdirectories.

I suggest keeping the original filename, though you might need to mangle it to guarantee uniqueness. This helps when you are diagnosing problems.

Robert Lewis
+3  A: 

Also, is there a performance issue with forcing downloads by sending header information and readfile()?

Yes, if you do it naively. A good file download script should:

  • stream long files to avoid filling memory
  • support ETags and Last-Modified request/response headers to ensure caches continue to work
  • come up with reasonable Expires/Cache-Control settings

It still won't be as fast as the web server (which is typically written in C and heavily optimised for serving files, maybe even using OS kernel features for it), but it'll be much better.

Would it be better to preserve file names and allow uses to direct download isn't of using a script?

It would perform better, yes, but getting the security right is a challenge. See here for some discussion.

A compromise is to use a rewrite, so that the URL looks something like:

hxxp://www.example.com/files/1234/Lovely_long_filename_that_can_contain_any_Unicode_character.zip

But it gets redirected internally to:

hxxp://www.example.com/realfiles/1234.dat

and served (quickly) by the web server.

bobince