views:

9083

answers:

12

Does it matter how many files I keep in a single directory? If so, how many files in a directory is too many, and what are the impacts of having too many files? (This is on a Linux server.)

Background: I have a photo album website, and every image uploaded is renamed to an 8-hex-digit id (say, a58f375c.jpg). This is to avoid filename conflicts (if lots of "IMG0001.JPG" files are uploaded, for example). The original filename and any useful metadata is stored in a database. Right now, I have somewhere around 1500 files in the images directory. This makes listing the files in the directory (through FTP or SSH client) take a few seconds. But I can't see that it has any affect other than that. In particular, there doesn't seem to be any impact on how quickly an image file is served to the user.

I've thought about reducing the number of images by making 16 subdirectories: 0-9 and a-f. Then I'd move the images into the subdirectories based on what the first hex digit of the filename was. But I'm not sure that there's any reason to do so except for the occasional listing of the directory through FTP/SSH.

+7  A: 

biggest issue i've run into is on a 32bit system once you pass a certain number tools like 'ls' stop working.

trying to do anything with that directory once you pass that barrier becomes a huge pain in the ass.

mike
+1  A: 

The question comes down to what you're going to do with the files.

Under Windows, any directory with more than 2k files tends to open slowly for me in Explorer. If they're all image files, more than 1k tend to open very slowly in thumbnail view.

At one time, the system-imposed limit was 32,767. It's higher now, but even that is way too many files to handle at one time under most circumstances.

Jekke
+10  A: 

It depends a bit on the specific filesystem in use on the Linux server. Nowadays the default is ext3 with dir_index, which makes searching large directories very fast.

So speed shouldn't be an issue, other than the one you already noted, which is that listings will take longer.

There is a limit to the total number of files in one directory. I seem to remember it definitely working up to 32000 files.

Bart Schuller
Gnome and KDE load large directories at a snails pace, windows will cache the directory so its reasonable. I love Linux, but kde and gnome are poorly written.
Rook
+1  A: 

it really depends on the filesystem used, and also some flags.

for example, ext3 can have many thousands of files; but after a couple thousands, it used to be very slow. mostly when listing a directory, but also when opening a single file. a few years ago, it gained the 'htree' option, that dramatically shortened the time needed to get an inode given a filename.

personally, i use subdirectories to keep most levels under a thousand or so items. in your case, i'd create 256 dirs, with the two last hex digits of the ID. use the last and not first digits, so you get the load balanced.

Javier
If the filenames were completely random, it wouldn't matter which digits were used.
strager
Indeed, these filenames are generated randomly.
Kip
+5  A: 

See also this related question: http://stackoverflow.com/questions/446358/storing-a-large-number-of-images

therefromhere
A: 

I recall running a program that was creating a huge amount of files at the output. the files were sorted at 30000 per directory. I do not recall having any read problems when I had to reuse the produced output. It was on an 32bit ubuntu linux laptop, and even nautilus displayed the directory contents, albeit after a few seconds.

EDIT: forgot to mention : ext3 filesystem. a similar code on a 64bit system dealt well with 64000 files per dir

+25  A: 

FAT32: Maximum number of files: 268,435,437 Maximum file size: 4GB
maximum number of files per directory: 65535

NTFS: Maximum number of files: 4,294,967,295 Maximum file size: 16TB currently (16EB theoretically)

Ext2: Maximum number of files: 10¹⁸ Maximum file size: 2TB
theoretical file per directory limit: 1.3 × 10²⁰ files

Ext3: Maximum number of files: number of bytes in volume/2¹³. Maximum file size: 16GB (1KB block) to 2TB (4KB block)

I found this data in several other forums, e.g. this one.

Edit: Fat32 has a limit of 65535 files per directory. As for non-FAT32, I could not find a "hard limit", but the information in this comparison chart is quite informative.

ISW
I assume these are the maximum number of files for the entire partition, not a directory. Thus, this information isn't too useful regarding the problem, because there'd be an equal number of files regardless of the method (unless you count directories as files).
strager
Updated the post concerning files >per directory< :-)
ISW
A: 

If the time involved in implementing a directory partitioning scheme is minimal, I am in favor of it. The first time you have to debug a problem that involves manipulating a 10000-file directory via the console you will understand.

As an example, F-Spot stores photo files as YYYY\MM\DD\filename.ext, which means the largest directory I have had to deal with while manually manipulating my ~20000-photo collection is about 800 files. This also makes the files more easily browsable from a third party application. Never assume that your software is the only thing that will be accessing your software's files.

Sparr
I advertise against partitioning by date because bulk imports might cluster files at a certain date.
mdorseif
A good point. You should definitely consider your use cases before picking a partitioning scheme. I happen to import photos over many days in a relatively broad distribution, AND when I want to manipulate the photos outside F-Spot date is the easiest way to find them, so it's a double-win for me.
Sparr
+14  A: 

Keep in mind that on Linux if you have a directory with too many files, the shell may not be able to expand wildcards. I have this issue with a photo album hosted on Linux. It stores all the resized images in a single directory. While the file system can handle many files, the shell can't. Example:

-shell-3.00$ ls A*
-shell: /bin/ls: Argument list too long

or

-shell-3.00$ chmod 644 *jpg
-shell: /bin/chmod: Argument list too long
Steve Kuo
@Steve, use find(1) and/or xargs(1) for these cases. For the same reason it's a good idea to use such tools in scripts instead of command line expansion.
Dave C
A: 

It absolutely depends on the filesystem. Many modern filesystems use decent data structures to store the contents of directories, but older filesystems often just added the entries to a list, so retrieving a file was an O(n) operation.

Even if the filesystem does it right, it's still absolutely possible for programs that list directory contents to mess up and do an O(n^2) sort, so to be on the safe side, I'd always limit the number of files per directory to no more than 500.

Michael Borgwardt
+1  A: 

I respect this doesn't totally answer your question as to how many is too many, but an idea for solving the long term problem is that in addition to storing the original file metadata, also store which folder on disk it is stored in - normalize out that piece of metadata. Once a folder grows beyond some limit you are comfortable with for performance, aesthetic or whatever reason, you just create a second folder and start dropping files there...

Goyuix
A: 

I'm working on a similar problem right now. We have a hierarchichal directory structure and use image ids as filenames. For example, an image with id=1234567 is placed in

..../45/67/1234567_<...>.jpg

using last 4 digits to determine where the file goes.

With a few thousand images, you could use a one-level hierarchy. Our sysadmin suggested no more than couple of thousand files in any given directory (ext3) for efficiency / backup / whatever other reasons he had in mind.

armandino