views:

331

answers:

5

If i have a site where users can upload as many images as they want(think photobucket-like), what is the best way to set up file storage (also, all uploads get a unique random timestamp)?

site root
--username
----image1.jpg
----image2.jpg
----image3.jpg
--anotheruser
----image1.jpg
----image2.jpg
----image3.jpg
...

or

siteroot
--uploads
----image1.jpg
----image2.jpg
----image3.jpg
----image4.jpg
----image6.jpg
...
----image50000.jpg

I think the first method is more organized. But i think the second method is standard(keeping all uploads in the same dir), but i wonder if it would be slower when retrieving an image if there are thousands of image in the same directory

--- edit ---

Thanks for the great answers so far. Also, i will be creating thumbnails, so i also would have to insert that directory somewhere... or, create a naming convention such as thumb_whatever.jpg.

so many different ways to do this. Yes disk space will be a problem. but for now i am concerned with retrieval time. When i have to output an image to the browser, if that image is in a directory with 10,000 other images, i am worried on how slow that could get.

+1  A: 

I think that subdirectories under the uploads directory would be the best.

site root
--uploads
----username
------image1.jpg
------image2.jpg
------image3.jpg
----anotheruser
------image1.jpg
------image2.jpg
------image3.jpg
...

Depending on the host OS, having too many files in one directory could cause some headaches and compatibility problems. Also, depending on how you are getting the image list, it could cause performance issues.

Plus, option 2 would be a mess. :)

Buggabill
+4  A: 

The answer to that is "maybe". It's possible the file retrieval may be fine, but if you need to do any maintenance on the folder, it would be a huge headache as processes attempt ot enumerate the directory listings.

What would improve the situation would be a number of sub directories under the images folder (or two levels, depending on how many images you're looking at storing), so you have a hierarchy like this:

siteroot
-- uploads
---- a
---- b
---- c
  :
---- z

...and then store files based on their first letter (so all images with names starting 'a' go into the folder 'a'). You could have this as a two or three letter suffix (aa, ab, ac, ad ..., ba, bb, bc ..., zx, zy, zz) and possibly have a hierarchy under that as well so you split files across a number of folders dependent on the first four characters of the name.

If files are then assigned a random alpha-numeric name then this would ensure files are spread evenly across all the folders (given a large enough sample size).

You might want to consider a mix of your option (1) and splitting images over a hierarchy as I've described above. That would ensure that if a single user does upload lots of files, then you're covered. Similarly, if you're looking at a lot of user directories, the same principle applies to ensure you don't have 1,000,000 user directories under a single parent.

Chris J
all nice... until you run out of diskspace.
Toad
@reinier -- you'll have diskspace issues no matter what strategy you use. At the end of the day, it's up to the software to handle a failure correctly. If you're thinking of inode counts, then two hierarchy's of folders is 676 nodes (assuming A-Z only). The OP is concerned with tens of thousands of files. Adding a few directories isn't going to impact that.
Chris J
chris: well not if you use a db where adding extra space is as easy as configuring an ini file. With folder schemes like you suggest, adding extra physical harddisks would lead to changing the naming scheme and thus you having to write a script which moves all files and folders to the new scheme, potentially running for days
Toad
@reinier -- if I ever saw code that couldn't handle an out-of-disk-space condition, I'd worry, regardless of the underlying storage medium (i.e., filesystem or database). Even with your solution, you still could have an out-of-disk-space issue. Okay, adding more space may be easier, BUT it doesn't preclude someone not monitoring free space.
Chris J
+2  A: 

try using mongodb... it is a keyvalue db which also allows to store binary data. It's very fast and efficient and supports sharding (placing data over multiple machines) out of the box

you really don't want to have folders and folders full of files. Managing these folders takes forever, and changing the naming/dividing scheme later is a nightmare. Furthermore, if you run out of diskspace you have a problem. Also for load balancing, having one harddisk full with files is not efficient

Toad
+2  A: 

It depends on the file system. For example, FAT16 tends to be quite slow if you have more than 512 files in a directory. FAT32 and NTFS do not have the same limitations but also run much more slowly if you have an extremely large amount of files. Even if you're running one of the more robust Linux file systems, you're still going to be able to parse directories more quickly if they're smaller.

I would definitely go with #2 - splitting the images into directories by user.

Adisak
+2  A: 

The number of files in a directory should have no effect at all on the time required to read a file's data - but it can massively affect the amount of time needed to find the file before you can start to read it.

The exact breakpoints where the major issues start up will vary from filesystem type to filesystem type, but, in general, if you're talking about a few hundred files, you don't much need to worry about it. If you're talking about a few thousand, it's worth thinking about and maybe doing a little benchmarking to see how your filesystem and hardware handle it. If you're talking about tens of thousands of files, then you really need to start breaking things up. (I once had a Linux/e2fs print server where CUPS wasn't deleting its job control files after it finished printing and it got up around 100,000 files in one directory. Just getting a directory listing took over half an hour before it even started to display any filenames.)

Separating them by user name may not be the best choice, though, since you'll likely have a lot of users uploading very few images and perhaps a couple who upload hundreds or thousands of images, potentially creating access time issues in those users' storage directories. The bigger problem in that scenario is that you'd likely end up (assuming a successful site) with thousands or tens of thousands of users and a large number of subdirectories is just as bad as a large number of files for slowing down access to your data.

Since you're going to have a timestamp on them, what I would probably do is put them into subdirectories based on the last three digits of the timestamp. That will distribute the files relatively evenly across 1000 subdirectories and should keep the number of files in each directory reasonably small. (Using the first three digits would cause one directory to be filled before moving to the next instead of distributing them evenly.) If you're still ending up with too many files in each subdirectory (which would likely mean you're dealing with several million uploaded images), you could add a second level for the previous three digits, so upload-1234567890.jpg would end up at /567/890/upload-1234567890.jpg.

Dave Sherohman

related questions