views:

63

answers:

3

Hi I'm developing a website which might grow up to a few thousand users, all of which would upload up to ten pictures on the server. I'm wondering what would be the best way of storing pictures. Lets assume that I have, 5000 users with 10 pictures each, which gives us 50 000 pics. (I guess it wouldn't be a good idea to store them in the database in blobs ;) )

Would it be a good way to dynamically create directories for every 100 users registered, (50 dirs in total, assuming 5000 users), and upload their pictures there? Would naming convention 'xxx_yy.jpg' (xxx being user id and yy picture number) be ok? In this case, however, there would be 1000 (100x10) pictures in one folder, isn't it too many?

A: 

granted i have never stored 50,000 images, but i usually just store all images in the same directory and name them as such to avoid conflict. then store the reference in the db.

$ext = explode( '.', $filename );
$newName = md5( microtime() ) . '.' . $ext;

that way you never have the same two filenames as microtime will never be the same.

David Morrow
+2  A: 

I would most likely store the images by a hash of their contents. A 128-bit SHA, for instance. So, I'd rename a user's uploaded image 'foo.jpg' to be its 128-bit sha (probably in base 64, for uniform 16-character names) and then store the user's name for the file and its SHA in a database. I'd probably also add a reference count. Then if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish.

As for actual physical storage, now that you have a guaranteed uniform naming scheme, you can use your file system as a balanced tree. You can either decide how many files maximum you want in a directory, and have a balancer move files to maintain this, or you can imagine what a fully populated tree would look like, and store your files that way.

The only real drawback to this scheme is that it decouples file names from contents so a database loss can mean not knowing what any file is called, but you should be careful to back up that kind of information anyway.

swestrup
what would be the advantage as having these in multiple directories as a balanced tree? I remember something about hash table implementations with multiple buckets from one of my computer science classes but I can't remember the advantage.
Hortinstein
If you've hashed them, you already have pretty good guarantee of the distribution of the file names, so balancing is probably not needed. You can just index into a set of dirs.... For example file with hash `202cb962ac59075b964b07152d234b70` get stored in /20/2c/202cb962ac59075b964b07152d234b70.jpg . But doing all this is redoing what the B-Tree structures in the filesystem do already.
Joe Koberg
@Hortinstein: This kind of advice originated when directory inodes were a linear linked-list of the directory contents, and had to be walked inorder to find any particular filename. To combat that you put 16 directories in /, then 16 more under each of those, etc... As with any other tree, if your filenames came in order AAAAAAAA, AAAAAAAAB, etc... you would be stuffing the "/A/A/A" directory full and leaving the "/Z" directory empty... But this makes little sense in 2010 where the filesystems have B-Tree directory structures.
Joe Koberg
The part about "if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish." ... is definitely a major positive aspect of this scheme and is used as a primary mechanism in the plan9 `venti` filesystem http://en.wikipedia.org/wiki/Venti
Joe Koberg
One can start with one directory. When it overflows your MAX_FILES_PER_DIRECTORY, you can then partition it so as to have N subdirectories. When one of them overflows it can be subdivided as well, and so on. Doing this you can (in theory) get pathological cases where some files are nested 10 deep, while others are still near the top. You'll need to read each of those directories as you descend the tree looking for a file. A balancer would rearrange the tree so that the fewest possibly directories need to be read.
swestrup
A: 

Different filesystems perform differently with directories holding large numbers of files. Some slow down tremendously. Some don't mind at all. For example, IBM JFS2 stores the contents of directory inodes as a B+ Tree sorted by filename.... so it probably provides log(n) access time even in the case of very large directories.

getting ls or dir to read, sort, get size/date info, and print them to stdout is a completely different task from accessing the file contents given the filename.... So don't let the inability of ls to list a huge directory guide you.

Whatever you do, don't optimize too early. Just make sure your file access mechanism can be asbstracted (make a FileStorage that you .getfile(id) from, or something...).

That way you can put in whatever directory structure you like, or for example if you find it's better to store these items as a BLOB column in a database, you have that option...

Joe Koberg