views:

547

answers:

12
+4  A: 

Whatever you do, don't store them all in one directory.

Depending on the distribution of the names of these images, you could create a directory structure where you have single letter top level folders where you would have another set of subfolders for the 2nd letter of images etc.

So:

Folder img\a\b\c\d\e\f\g\ would contain the images starting with 'abcdefg' and so on.

You could introduce your own appropriate depth required.

The great thing about this solution is that the directory structure effectively acts like a hashtable/dictionary. Given an image file name, you will know its directory and given a directory, you will know a subset of images that go there.

Wim Hollebrandse
\a\b\c\d\e\f\ i am doing now, i was thinking there is a wise way of doing this.
Mike
That's a generally accepted solution of how to physically store them. Clearly generating the image URL's is something that can be easily done dynamically based on the image file name. Also, to serve them up, you could even introduce img-a, img-b subdomains on the images server if you wanted to, to speed up loading times.
Wim Hollebrandse
Wim - that's exactly what i am doing now, just thought there are some other folks who've hit this problem.
Mike
You might get better distribution by using the last character (or two, or three) rather than the first.
Mark Ransom
@Mark The point is illustrative. It depends on the distribution, as I mentioned.
Wim Hollebrandse
And +1 for "don't store them all in one directory". I'm supporting a legacy system that has put over 47000 files on a server in a single folder, and it takes about a minute for Explorer just to open the folder.
Mark Ransom
Yep. Seen it too. :-o
Wim Hollebrandse
Doing a\b\c\d\e\f\g makes the directory structure very deep and every directory contains only few files. Better to use more that one letter per directory level e.g. ab\cd\ef\ or abc\def\ . Directories also take up space from disk so you do not want too many of them.
Juha Syrjälä
It's an illustration - the concept remains the same. It doesn't necessarily make your directory structure deep as it also depends on filename length. Just because I started with a,b,c doesn't mean we need 26 levels.
Wim Hollebrandse
+5  A: 

I would store these on the file system but it depends on how fast will the number of files grow. Are these files hosted on the web? How many users would access these file? These are the questions that need to be answered before I could give you a better recommendation. I would also look at Haystack from Facebook, they have a very good solution for storing and serving up images.

Also if you choose file system you will need to partition these files with directories. I been looking at this issue and proposed a solution but its not a perfect one by any means. I am partitioning by hash table and users you can read more on my blog.

Lukasz
the images are not meant for frequent access. so there is no problem with this.their number will grow quite fast. i assume there will be the 1mil. mark in 1 month.
Mike
i'm interested in the programmer view so that i don't overthink this too much
Mike
So if you do not need fast access Haystack is probably not for you. Using Directories for Partitions is the simplest solution in my view.
Lukasz
+6  A: 

Ideally, you should run some tests on random access times for various structures, as your specific hard drive setup, caching, available memory, etc. can change these results.

Assuming you have control over the filenames, I would partition them at the level of 1000s per directory. The more directory levels you add, the more inodes you burn, so there's a push-pull here.

E.g.,

/root/[0-99]/[0-99]/filename

Note, http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx has more details on NTFS setup. In particular, "If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation for better performance, and especially if the first six characters of the long file names are similar."

You should also look into disabling filesystem features you don't need (e.g., last access time). http://www.pctools.com/guides/registry/detail/50/

Jason Yanowitz
A: 

I read your comment that db is storing path. A database, SQL 2008, might be a good place to actually store the images themselves. Especially if retrieval and backup are required.

dove
yes, but this would generate huge db files. it's way easyser just to store them in the filesystem and then just copy them to another machine, if load balancing will be needed, they will have the same structure, just a different dns
Mike
Storing images in DB's is highly overrated. Especially with those numbers, you simply do not want to go there. Why all the overhead?
Wim Hollebrandse
Never beena fan of storing binary data in a DB, it seems to be contrary to the point of DBs.
John
Nooooooooooooooooooooo!
Neil N
I put this above, however, SQL 2008 has a new datatype called FILESTREAM. It's easy to work with and the files actually live on the file system.
Chris Lively
-1: see Satanicpuppy's comment: http://stackoverflow.com/questions/1923096/storing-a-million-images/1923211#1923211
Trevor Harrison
Ya probably not the best idea.
Mike
Well, the FILESTREAM idea sounds pretty good, but otherwise its a definite no no...
Sudhir Jonathan
A: 

How about a database with a table containing an ID and a BLOB to store the image? Then you can add new table(s) whenever you want to associate more data elements with a photo.

If you're expecting to scale, why not scale now? You'll save time both now and later IMO. Implement the database layer once, which is fairly easy to start with. Or implement some thing with folders and filenames and blah blah blah, and later switch to something else when you start blowing up MAX_PATH.

jdmichal
Been there, done that, have the scars to prove it. Databases that store images in large numbers are cranky almost beyond belief, and require inordinate amounts of maintenance. Much better to store them in the file system unless you have a specific need that can only be answered by a database (ours was version tracking.)
Satanicpuppy
And there are lots of utilities to deal with files and file systems, few to none to deal with files within a database.
Mark Ransom
Oh God No. Please dont use a database as large BLOB storage.
Neil N
Eek. Didn't know that databases (still?) have so many problems with BLOBs.
jdmichal
+1  A: 

Perhaps a creation date based naming scheme - either including all the info in the file name or (better for browsing later) splitting it up in directories. I can think of the following, depending on how often you generate images:

  • Several images generated each day: Year/Month/Day/Hour_Minute_Second.png
  • A couple a month: Year/Month/Day_Hour_Minute_Second.png

etc. You get my point... =)

Tomas Lycken
they are not continuously generated over time, so some folders will become fat and others stay... slim :))
Mike
Well, you obviously don't have to create *each* folder, just because you're following this scheme. You could even have `Year/Month/Day/Hour/Minute` - decide how many levels of folders you need depending on how often the images are generated *when the rate is the highest* - and then just don't create folders that would be left empty.
Tomas Lycken
+4  A: 

I'm going to put my 2 cents worth in on a piece of negative advice: Don't go with a database.

I've been working with image storing databases for years: large (1 meg->1 gig) files, often changed, multiple versions of the file, accessed reasonably often. The database issues you run into with large files being stored are extremely tedious to deal with, writing and transaction issues are knotty and you run into locking problems that can cause major train wrecks. I have more practice in writing dbcc scripts, and restoring tables from backups than any normal person should ever have.

Most of the newer systems I've worked with have pushed the file storage to the file system, and relied on databases for nothing more than indexing. File systems are designed to take that sort of abuse, they're much easier to expand, and you seldom lose the whole file system if one entry gets corrupted.

Satanicpuppy
yes. note taken !
Mike
Have you looked at SQL 2008's FILESTREAM data type? It's a cross between database and file system storage.
Chris Lively
+1 on sticking with file server rather than a database as you are doing fast and infrequent IO operations.
Jay Zeng
+1  A: 

Quick point, you don't need to store a file path in you DB. You can just store a numeric value, if your files are named in the way you describe. Then using one of the well-defined storage schemes already discussed, you can get the index as a number and very quickly find the file by traversing the directory structure.

John
:-? good quick point. just that now i don't have an algorithm for generating the path.
Mike
+2  A: 

The new MS SQL 2008 has a new feature to handle such cases, it's called the FILESTREAM. Take a look:

Microsoft TechNet FILESTREAM Overview

Padu Merloti
+5  A: 

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

  1. First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
    • 12345 -> 000000012345.jpg
  2. Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
    • 000000012345 -> 000/000/012
  3. Store the file to under generated directory:
    • Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
    • For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

  • Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
  • There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
  • Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
  • Directory structure itself will take some disk space, so you'll do not want too many directories.
  • With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
  • If you need to access files from several machines, consider sharing the files via a network file system.
  • The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.
Juha Syrjälä
very interesting! splitting the filename ... i didn't thought of that. i assume this is the elegant way of doing it :-?
Mike
Using a hash (such as MD5) as the name of the file, as well as the directory distribution, would work. Not only would the integrity of the files be a side benefit to the naming scheme (easily checked), but you'll have a reasonably even distribution throughout the directory hierarchy. So if you have a file named "f6a5b1236dbba1647257cc4646308326.jpg" you'd store it in "/f/6" (or as deep as you require). 2 levels deep gives 256 directories, or just under 4000 files per directory for the initial 1m files. It would also be very easy to automate the redistribution to a deeper scheme.
Geoff Fritz
+2  A: 

Will your images need to be named uniquely? Can the process that generates these images produce the same filename more than once? Hard to say without knowing what device is creating the filename but say that device is 'reset' and upon restart it begins naming the images as it did the last time it was 'reset' - if that is such a concern..

Also, you say that you will hit 1 million images in one month's time. How about after that? How fast will these images continue to fill the file system? Will they top off at some point and level out at about 1 million TOTAL images or will it continue to grow and grow, month after month?

I ask because you could begin designing your file system by month, then by image. I might be inclined to suggest that you store the images in such a directory structure:

imgs\yyyy\mm\filename.ext

where: yyyy = 4 digit year
         mm = 2 digit month

example:  D:\imgs\2009\12\aaa0001.jpg
          D:\imgs\2009\12\aaa0002.jpg
          D:\imgs\2009\12\aaa0003.jpg
          D:\imgs\2009\12\aaa0004.jpg
                   |
          D:\imgs\2009\12\zzz9982.jpg
          D:\imgs\2010\01\aaa0001.jpg (this is whyI ask about uniqueness)
          D:\imgs\2010\01\aaa0001.jpg

Month, year, even day is good for security type images. Not sure if this is what you are doing but I did that with a home security camera that snapped a photo every 10 seconds... This way your application can drill down to specific time or even a range where you might think the image was generated. Or, instead of year, month - is there some other "meaning" that can be derived from the image file itself? Some other descriptors, other than the date example I gave?

I would not store the binary data in the DB. Never had good performance / luck with that sort of thing. Cant imagine it working well with 1 million images. I would store the filename and that is it. If they are all going to be JPG then dont even store the extension. I would create a control table that stored a pointer to the file's server, drive, path, etc. This way you can move those images to another box and still locate them. Have you a need to keyword tag your images? If so then you would want to build the appropriate tables that allow that sort of tagging.

You / others may have addressed these ideas while I was replying.. Hope this helps..

Optimal Solutions
1.all files will be named uniquely2.the system will grow and grow at first it will get out arround 1mil images and then grow at a rate of a couple tens of thousands per month.3.there will be some sort of tagging of the files at some point in the future, that's why i want to store some sort of identification data in the db.
Mike
A: 

If you are on windows how about on an exFat filessytem

http://msdn.microsoft.com/en-us/library/aa914353.aspx

it was designed with storing media files in mind, and available now.

Alex