views:

810

answers:

5

I am getting thousands of pictures uploaded by thousands of users on my Linux server, which is hosted by 1and1.com (I believe they use CentOS, but am unsure of the version). This is a language agnostic question, however, for your reference, I am using PHP.

My first thought was to just dump them all in the same directory, however, I remember a little while ago, there was a limit to how many files or directories could be dropped in a directory.

My second thought was to partition the files inside directories based on the users email address (as it is what I am using for the user name anyhow) but I don't want to run into the limit for directories in a directory....

Anyhow, for images from [email protected], I was going to do this:

/images/domain.com/user/images...

Is this smart to do, what if thousands of users have say 'gmail' perhaps I could even go deeper, like this

/images/domain.com/[first letter of user name]/user/images...

so for [email protected] it would be...

/images/domain.com/m/mike/images...

Is this a bad approach? What is everyone else doing? I don't want to run into problems with too many directories also...


Related:

+2  A: 

What I used for another requirement but which can fit your needs is to use a simple convention.

Increment by 1 and get the length of the new number, and then prefix with this number.

For example:

Assume 'a' is a var which is set with the last id.

a = 564;
++a;
prefix = lenght(a);
id = prefix + a; // 3564

Then, you can use a timestamp for the directory, using this convention:

20092305 (yyyymmdd)

Then you can explode your path like this:

2009/23/05/3564.jpg

(or more)

It's interesting because you can keep a sort order by date, and by number at the same time (sometimes useful) And you can still decompose your path in more directories

Boris Guéry
+8  A: 

I would do the following:

  1. Take an MD5 hash of each image as it comes in.
  2. Write that MD5 hash in the database where you are keeping track of these things.
  3. Store them in a directory structure where you use the first couple of bytes of the MD5 hash hex string as the dir name. So if the hash is 'abcdef1234567890' you would store it as 'a/b/abcdef1234567890'.

Using a hash also lets you merge the same image uploaded multiple times.

Joe Beda
+2  A: 
Jason S
all the images will be below the root of the web folder, so that they can not access them without using our function to retrieve them.
Mike Curry
still, if you make the structure they access them from, coupled to the structure you store them in, then you're stuck w/o changing the URL. If you decouple, you can change the storage structure later if necessary.
Jason S
A: 

Here are two functions I wrote a while back for exactly this situation. They've been in use for over a year on a site with thousands of members, each of which has lots of files.

In essence, the idea is to use the last digits of each member's unique database ID to calculate a directory structure, with a unique directory for everyone. Using the last digits, rather than the first, ensures a more even spread of directories. A separate directory for each member means maintenance tasks are a lot simpler, plus you can see where's people's stuff is (as in visually).

// checks for member-directories & creates them if required
function member_dirs($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_dir[0] = $GLOBALS['site_path'] . "files/members/" . $dir_1 . "/";
    $user_dir[1] = $user_dir[0] . $dir_2 . "/";
    $user_dir[2] = $user_dir[1] . $dir_3 . "/";
    $user_dir[3] = $user_dir[2] . $user_id . "/";
    $user_dir[4] = $user_dir[3] . "sml/";
    $user_dir[5] = $user_dir[3] . "lrg/";

    foreach ($user_dir as $this_dir) {
     if (!is_dir($this_dir)) { // directory doesn't exist
      if (!mkdir($this_dir, 0777)) { // attempt to make it with read, write, execute permissions
       return false; // bug out if it can't be created
      }
     }
    }

    // if we've got to here all directories exist or have been created so all good
    return true;

}

// accompanying function to above
function make_path_from_id($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_path = "files/members/" . $dir_1 . "/" . $dir_2 . "/" . $dir_3 . "/" . $user_id . "/";
    return $user_path;

}

sanitize_var() is a supporting function for scrubbing input & ensuring it's numeric, $GLOBALS['site_path'] is the absolute path for the server. Hopefully, they'll be self-explanatory otherwise.

da5id
A: 

Joe Beda's answer is almost perfect, but please note that the MD5 has been proven to be collidable in iirc 2 hours on a laptop?

That said, if You actually will use the file's MD5 hash in the described way, Your service will become vulnerable to attacks. How will the attack look like?

  1. A hacker doesn't like a particular photo
  2. He ensures that this is plain MD5 that You are using (MD5 of image+secret_string can scare him out)
  3. He uses a magic method of colliding a picture of (use Your imagination here) hash with the photo he doesn't like
  4. He uploads the photo like he would normally do
  5. Your service overwrites the old one with the new one and displays both

Someone says: let's not overwrite it then. Then, if it's possible to predict that someone will upload something (f.e. a popular picture on the web might get uploaded), it's possible to take the "hash-place" of it first. User would be happy when uploading a picture of a kitty, He would find that it actually appears as (use Your imagination here). I say: use SHA1, as it's been proven to be hackable in iirc 127 years by a 10.000 computers cluster?

Reef
you're talking about a preimage attack, which hasn't been successful yet against MD5, only collision attacks http://www.vpnc.org/hash.html
Jason S
http://en.wikipedia.org/wiki/MD5 : "On 1 March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstrated construction of two X.509 certificates with different public keys and the same MD5 hash, a demonstrably practical collision." (...)
Reef
I use 'salt' on all my food... :D
Mike Curry