tags:

views:

998

answers:

6

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?

+4  A: 

ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.

If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.

The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.

Edit:

ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.

The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.

I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.

Tim Post
+2  A: 

When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.

It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.

aaa/
   aaavnj78t93ufjw4390
   aaavoj78trewrwrwrwenjk983
   aaaz84390842092njk423
   ...
abc/
   abckhr89032423
   abcnjjkth29085242nw
   ...
...
vog
+2  A: 

Most distros use Ext3 by default, which can use b-tree indexing for large directories. Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.

To see if dir_index feature is activated do (as root):

tune2fs -l /dev/sdaX | grep features

To activate dir_index feature (as root):

tune2fs -O dir_index /dev/sdaX
e2fsck  -D /dev/sdaX

Replace /dev/sdaX with partition for which you want to activate it.

vartec
There *is* a penalty, the difference is between an exponential, linear or a logarithmic penalty
dsm
Do your numbers. You have 1mln files. With index it takes you N seconds to access file by name. Now you'd divide it into 1000 directories with 1000 files each. It takes you N/2 seconds to access directory, another N/2 to access file. Total N seconds not counting overhead for switching dirs.
vartec
A: 

The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?

Chris Huang-Leaver
+2  A: 

My experience with large directories on ext3 and dir_index enabled:

  • If you know the name of the file you want to access, there is almost no penalty
  • If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
  • If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once

Without dir_index however, you are really screwed :-D

ypnos
A: 

Not every filesystem supports that many files.

On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.

HMage