views:

2171

answers:

13

I'm thinking about developing my own PHP based gallery for storing lots of pictures, maybe in the tens of thousands.

At the database I'll point to the url of the image, but here's the problem: I know is impractical to have all of them sitting at the same directory in the server as it would slow access to a crawl, so, how would you store all of them? Some kind of tree based on the name of the jpeg/png?

What rules to partition the images would you recommend me?

(It will focused for using in cheapo dot coms, so no mangling with the server is possible)

A: 

Use the hierarchy of the file system. ID your images using something like 001/002/003/004.jpg would be very helpful. Partitioning is a different story, though. Could be random, content based, creation date based, etc. Really depends on what your application is.

PolyThinker
A: 

You may check out the stratey used by Apple iPod for storing it's multimedia content. There are folders in one level of depth and files with titles of same width. I believe that Apple guys invested a lot of time in testing their solution so it may bring some instant benefit to you.

Boris Pavlović
I'm not too clear what you mean here. Can you give an example?
rikh
+18  A: 

We had a similar problem in the past. And found a nice solution:

  • Give each image an unique guid.
  • Create a database record for each image containing the name, location, guid and possible location of sub images (thumbnails, reducedsize, etc.).
  • Use the first (one or two) characters of the guid to determine the toplevel folder.
  • If the folders have too much files, split again. Update the references and you are ready to go.
  • If the number of files and the accesses are too high, you can spread folders over different file servers.

We have experienced that using the guids, you get a more or less uniform division. And it worked like a charm.

Links which might help to generate a unique ID:

Gamecat
If you use a database anyway, why not just make it a blob and let the database worry about it?
roe
because of performance, database calls are usually really expensive especially for binary data like images.
Mike Geise
not to mention that serving images out of the database means you pretty much always send the data where as if you can serve from the file system you can let the browser/server handle caching of images
MikeJ
A: 

If the pictures you're handling are digital photographs, you could use EXIF data to sort them, for example by capture date.

Keltia
A: 

You can store the images in the database as blobs (varbinary for mssql). That way you don't have to worry about the storage or directory structure. The only downside is that you can't easily browse the files, but that would be hard in a balanced directory tree anyway.

Mats Fredriksson
A: 

You could alawys have a DateTime column in the table and then store them in folders named after the month,year or even month,day,year the images where added to the table.

Example

  1. 2009
  2. -01
  3. --01
  4. --02
  5. --03
  6. --31

this way you end up with no more then 3 folders deep.

Mike Geise
+6  A: 

I usually just use the numerical database id (auto_increment) and then use the modulu (%) operator to figure out where to put the file. Simple and scalable. For instance the path to image with id 12345 could be created like this:

12345 % 100 = 45
12345 % 1000 = 345

Ends up in:

/home/joe/images/345/45/12345.png

Or something like that.

If you're using Linux and ext3 and the filesystem, you must be aware that there are limits to the number of directories and files you can have in a directory. The limit is 32000 for dirs, so you should always strive to keep number of dirs low.

Martin Wickman
+3  A: 

I worked on an Electronic Document Management system a few years ago, and we did pretty much what Gamecat and wic suggested.

That is, assign each image a unique ID, and use that to derive a relative path to the image file. We used MOD similar to what wic suggested, but we allowed 1024 folders/files at each level, with 3 levels, so we could support 1G files.

We stripped off the extension from the files however. The DB records contained the MIME Type, so extension was not needed.

I would not recommend storing the full URL in the DB record, only the Image ID. If you store the URL you can't move or restructure your storage without converting your DB. A relative URL would be ok since that way you can at least move the image repository around, but you'll get more flexibility if you just store the ID and derive the URL.

Also, I would not recommend allowing direct references to your image files from the web. Instead, provide a URL to a server-side program (e.g., Java Servlet), with the Image ID being supplied in the URL Query (http://url.com/GetImage?imageID=1234).

The servlet can use that ID to look up the DB record, determine MIME Type, derive the actual location, check for security restrictions, logging, etc.

Clayton
good points. does the servlet request still allow for caching ? i am looking at a similar problem but in my app the transfer time is critical so I was looking for ways to cache the images on the client. Am I dreaming?
MikeJ
@MikeJ: You could create a separate class for access to the images. That class would know how to derive a path from an id, etc. It could also contain a cache, possibly as a hashtable that you manage yourself, or maybe a canned cache class. Servlet would get images from this object, not from disk.
Clayton
A: 

Look at XFS Filesystem. It supports an unlimited number of files, and Linux supports it. http://oss.sgi.com/projects/xfs/papers/xfs%5Fusenix/index.html

EXTROMEDIA
+1  A: 

When saving files associated with an auto_increment ids, I use something like the following, which creates three directory levels, each comprised of 1000 dirs, and 100 files in each third-level directory. This supports ~ 100 billion files.

if $id = 99532455444 then the following returns /995/324/554/44

function getFileDirectory($id) {
    $level1 = ($id / 100000000) % 100000000;
    $level2 = (($id - $level1 * 100000000) / 100000) % 100000;
    $level3 = (($id - ($level1 * 100000000) - ($level2 * 100000)) / 100) % 1000;
    $file   = $id - (($level1 * 100000000) + ($level2 * 100000) + ($level3 * 100));

    return '/' . sprintf("%03d", $level1)
         . '/' . sprintf("%03d", $level2)
         . '/' . sprintf("%03d", $level3)
         . '/' . $file;
}
Isaac
A: 

I know is impractical to have all of them sitting at the same directory in the server as it would slow access to a crawl.

This is an assumption.

I have designed systems where we had millions of files stored flat in one directory, and it worked great. It's also the easiest system to program. Most server filesystems support this without a problem (although you'd have to check which one you were using).

http://www.databasesandlife.com/flat-directories/

Adrian Smith
A: 

@Adrian Smith

i totally agree with you but it's hard to BACKUP these monster-directories ! all self-build or "professional" solutions (rsync) failed totally, even a step-by-step-one-file-after-one-file script killed my CPUs (and my live high traffic server). it's the worst thing ever if you realize that your free time project gets bigger and bigger and you dont know how to handle these masses without going offline for days.

Chris
A: 

i have to add something:

there's a very good php script out there that does handle exactly this problem: http://cachemogul.mawhorter.net/

it creates "high-performant" directory structures and filenames, all you have to do is give the filename to this class. cool! and its free.

Chris