If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
views:
998answers:
6ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/ aaavnj78t93ufjw4390 aaavoj78trewrwrwrwenjk983 aaaz84390842092njk423 ... abc/ abckhr89032423 abcnjjkth29085242nw ... ...
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index
feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index
feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX
with partition for which you want to activate it.
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
My experience with large directories on ext3 and dir_index
enabled:
- If you know the name of the file you want to access, there is almost no penalty
- If you want to do operations that need to read in the whole directory entry (like a simple
ls
on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore - If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (
*
) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index
however, you are really screwed :-D
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.