views:

26

answers:

2

What is the best practise for storing a large number of files that are referenced in a database in the file system?

We're currently moving from a system that stores around 14,000 files (around 6GB of images and documents) in a MySQL database. This is quickly becoming unmanageable.

We currently plan to save the files by their database primary key in the file system. I'm concerned about the possible performance issues of having that many files in the same folder. Also, these files will be inserted by several different applications on the same server.

Specifically I'd like to know:

  • Is this a good solution given these parameters?
  • Will it leave room to scale further in the future?
  • Are there any concerns about storage of many files in the same location?
  • Is there a better way to name/distribute the files?
+1  A: 

Hash the contents with MD5, then add a suffix (the PK will suffice for this) to get the file's new filename. Create 16 folders corresponding to the first character of the hash. Create 16 folders under each of those for the second character. Store the image in the appropriate path based on the first 2 hex characters of the hash, then add the hash to the appropriate record in the database.

Ignacio Vazquez-Abrams
What is the reason to get hash from **content**? Why don't just hash PK or `rand()` or `time()`?
zerkms
@zerkms: No real reason. It's just a convenient place to get it.
Ignacio Vazquez-Abrams
I think that hashing of 6gb of images is not a really convenient ;-)
zerkms
@zerkms: That's less than 2 DVDs' worth. It's only a couple of hours of "work".
Ignacio Vazquez-Abrams
Then, again: why to perform "couple of hours" if this can be done almost instantly with hashing microtime? ;-)
zerkms
@zerkms: Hashing time is a little less reproducible than hashing the contents though.
Ignacio Vazquez-Abrams
I like the concept of hashing the contents. Time isn't really a factor for these files, since associated meta-data is stored in the database. And I assume the reason for MD5 is for a roughly even distribution? Also are you aware of the rough performance limits of files per folder?
Mathew Byrne
Yes, the hashing will spread them around in the directory structure. NTFS, ext3 without dir_index, and XFS will take a few seconds with all the files in the same directory. Under the same conditions, ext3/4 with dir_index won't flinch (a fraction of a second). Spreading the files across 256 folders is a compromise. If you believe it's warranted then it's also possible to use 2 hex digits per level, which will split the files into 65,536 buckets.
Ignacio Vazquez-Abrams
A: 

I like to name the file as following /* create directory */ $dir = date('Y').'/'.date('m').'/'.date('d');

Sam