ansaurus

Question

Organizing thousands of images on a server

Answer 1

A:

granted i have never stored 50,000 images, but i usually just store all images in the same directory and name them as such to avoid conflict. then store the reference in the db.

$ext = explode( '.', $filename );
$newName = md5( microtime() ) . '.' . $ext;

that way you never have the same two filenames as microtime will never be the same.

David Morrow 2010-02-25 19:32:35

Answer 2

+2 A:

I would most likely store the images by a hash of their contents. A 128-bit SHA, for instance. So, I'd rename a user's uploaded image 'foo.jpg' to be its 128-bit sha (probably in base 64, for uniform 16-character names) and then store the user's name for the file and its SHA in a database. I'd probably also add a reference count. Then if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish.

As for actual physical storage, now that you have a guaranteed uniform naming scheme, you can use your file system as a balanced tree. You can either decide how many files maximum you want in a directory, and have a balancer move files to maintain this, or you can imagine what a fully populated tree would look like, and store your files that way.

The only real drawback to this scheme is that it decouples file names from contents so a database loss can mean not knowing what any file is called, but you should be careful to back up that kind of information anyway.

swestrup 2010-02-25 19:50:09

what would be the advantage as having these in multiple directories as a balanced tree? I remember something about hash table implementations with multiple buckets from one of my computer science classes but I can't remember the advantage.

Hortinstein 2010-02-25 20:15:17

If you've hashed them, you already have pretty good guarantee of the distribution of the file names, so balancing is probably not needed. You can just index into a set of dirs.... For example file with hash `202cb962ac59075b964b07152d234b70` get stored in /20/2c/202cb962ac59075b964b07152d234b70.jpg . But doing all this is redoing what the B-Tree structures in the filesystem do already.

Joe Koberg 2010-02-25 20:21:11

@Hortinstein: This kind of advice originated when directory inodes were a linear linked-list of the directory contents, and had to be walked inorder to find any particular filename. To combat that you put 16 directories in /, then 16 more under each of those, etc... As with any other tree, if your filenames came in order AAAAAAAA, AAAAAAAAB, etc... you would be stuffing the "/A/A/A" directory full and leaving the "/Z" directory empty... But this makes little sense in 2010 where the filesystems have B-Tree directory structures.

Joe Koberg 2010-02-25 20:32:24

The part about "if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish." ... is definitely a major positive aspect of this scheme and is used as a primary mechanism in the plan9 `venti` filesystem http://en.wikipedia.org/wiki/Venti

Joe Koberg 2010-02-25 20:36:11

One can start with one directory. When it overflows your MAX_FILES_PER_DIRECTORY, you can then partition it so as to have N subdirectories. When one of them overflows it can be subdivided as well, and so on. Doing this you can (in theory) get pathological cases where some files are nested 10 deep, while others are still near the top. You'll need to read each of those directories as you descend the tree looking for a file. A balancer would rearrange the tree so that the fewest possibly directories need to be read.

swestrup 2010-02-25 20:40:31

Answer 3

A:

Different filesystems perform differently with directories holding large numbers of files. Some slow down tremendously. Some don't mind at all. For example, IBM JFS2 stores the contents of directory inodes as a B+ Tree sorted by filename.... so it probably provides log(n) access time even in the case of very large directories.

getting ls or dir to read, sort, get size/date info, and print them to stdout is a completely different task from accessing the file contents given the filename.... So don't let the inability of ls to list a huge directory guide you.

Whatever you do, don't optimize too early. Just make sure your file access mechanism can be asbstracted (make a FileStorage that you .getfile(id) from, or something...).

That way you can put in whatever directory structure you like, or for example if you find it's better to store these items as a BLOB column in a database, you have that option...

Joe Koberg 2010-02-25 20:15:04

ansaurus

tags:

views:

answers:

Organizing thousands of images on a server

related questions