views:

141

answers:

5

We're creating an ASP.Net MVC site that will need to store 1 million+ pictures, all around 2k-5k in size. From previous ressearch, it looks like a file server is probably better than a db (feel free to comment otherwise).

Is there anything special to consider when storing this many files? Are there any issues with Windows being able to find the photo quickly if there are so many files in one folder? Does a segmented directory structure need to be created, for example dividing them up by filename? It would be nice if the solution would scale to at least 10 million pictures for potential future expansion needs.

+3  A: 

4Kb is the default cluster size for NTFS. You might tune this settings depending on usual picture size. http://support.microsoft.com/kb/314878

I would build a tree with subdirectories to be able to move from one FS to another : http://stackoverflow.com/questions/466521/how-many-files-in-a-directory-is-too-many and avoid some issues : http://www.frank4dd.com/howto/various/maxfiles-per-dir.htm

You can also have archives containing associated pictures to load them with only one file open. Thoses archives might be compressed is the bottleneck is I/O, uncompressed if it's CPU.

A DB is easier to maintain but slower... so it's up to you!

Guillaume
+1  A: 

Assuming NTFS, there is a limit of 4 billion files per volume (2^32 - 1). That's the total limit for all the folders on the volume (including operating system files etc.)

Large numbers of files in a single folder should not be a problem; NTFS uses a B+ tree for fast retrieval. Microsoft recommends that you disable short-file name generation (the feature that allows you to retrieve mypictureofyou.html as mypic~1.htm).

I don't know if there's any performance advantage to segmenting them into multiple directories; my guess is that there would not be an advantage, because NTFS was designed for performance with large directories.

If you do decide to segment them into multiple directories, use a hash function on the file name to get the directory name (rather than the directory name being the first letter of the file name for instance) so that each subdirectory has roughly the same number of files.

Mark Lutton
While code might be able to read a file in a directory with a very large number of total files, it still isn't a great idea. If you've ever tried to open a directory in Explorer with several thousand files it is very slow. Hashing into sub-directories helps a lot with that.
Kleinux
The slowness in Explorer is probably due more to what Explorer is trying to do with all those file names rather than retrieving the file names themselves. It will take a long time to read all the files and show thumbnails for instance. Retrieving an individual file if you already know the filename should be fast. If you write your own system for storing and retrieving files, you might or might not get better performance than NTFS.
Mark Lutton
+1  A: 

I wouldn't rule out using a content delivery network. They are designed for this problem. I've had a lot of success with Amazon S3. Since you are using a Microsoft based solution, perhaps Azure might be a good fit.

Is there some sort of requirement that prevents you from using a third-party solution?

Doug R
+2  A: 

See also this Server Fault question for some discussion about directory structures.

Juha Syrjälä
+1  A: 

The problem is not that the filesystem is not able to store so many files in a directory but that if you want to access that directory using windows explorer it will take forever, so if you will ever need to access manually to that folder you should segment it, for example with a directory per each 2-3 first letters/numbers of the name or even a deeper structure.

If you could divide that in 1k folders with 1k files each will be more than enough and the code to do that is quite simple.

Marc Climent