tags:

views:

209

answers:

1

We are having a problem on Linux with directory inodes getting large and slow to navigate over time, as many files are created and removed. For example:

% ls -ld foo
drwxr-xr-x    2 webuser  webuser   1562624 Oct 26 18:25 foo
% time find foo -type f | wc -l
    518
real    0m1.777s
user    0m0.000s
sys     0m0.010s

% cp -R foo foo.tmp
% ls -ld foo.tmp                                                                       
drwxr-xr-x    2 webuser  webuser     45056 Oct 26 18:25 foo.tmp   
% time find foo.tmp -type f | wc -l
    518
real    0m0.198s
user    0m0.000s
sys     0m0.010s

The original directory has 518 files, takes 1.5 MB to represent, and takes 1.7 seconds to traverse.

The rebuilt directory has the same number of files, takes 45K to represent and .2 seconds to traverse.

I'm wondering what would cause this. My guess is fragmentation - this is not supposed to be a problem with Unix file systems in general, but in this case we are using the directory for short-term cache files and are thus constantly creating, renaming and removing a large number of small files.

I'm also wondering if there's a way to dump the literal binary contents of the directory - that is, read the directory as if it were a file - which would perhaps give me insight into why it is so big. Neither read() nor sysread() from Perl will allow me to:

 swartz> perl -Mautodie -MPOSIX -e 'sysopen(my $fh, "foo", O_RDONLY); my $len = sysread($fh, $buf, 1024);'
 Can't sysread($fh, '', '1024'): Is a directory at -e line 1

System info:

Linux 2.6.18-128.el5PAE #1 SMP Wed Dec 17 12:02:33 EST 2008 i686 i686 i386 GNU/Linux

Thanks!

Jon

+2  A: 

For question 1, external fragmentation normally causes an overhead of about 2x or so,1 plus you have internal fragmentation from allocation granularity. Neither of these comes close to explaining your observation.

So, I don't think it is normal steady-state fragmentation.

The most obvious speculation is that 1.5MB is the high-water mark; at one time it really did have either 1.5MB bytes of entries or 1.5MB/2 bytes of entries with expected fragmentation.

Another speculation is that the 50% rule is being defeated by a non-Markovian allocation. Imagine that I name files with "tmp%d", so, tmp1, tmp2, ... tmp1000, tmp1001, ...

The problem here is that rm tmp1 doesn't make room for tmp1001. This is obviously a wild guess.

Q2: There isn't a good way to read the raw directory. AFAIK, you would need to either hack the kernel or use debugfs to change the inode type, read it, then change it back, or use debugfs to read the inode, get the block numbers, then read the blocks. A functional debugging approach is probably more reasonable.

You can address the performance issue by making sure that indexing is enabled. See tune2fs.


1Knuth's fifty percent rule: in the steady state, 50% of ops are allocations, 50% are frees, 50% of free blocks merge, then holes are 50% of allocations, and 50% of the space is wasted. (Aka, 100% overhead.) This is considered "normal". Malloc has the same problem.

DigitalRoss
Good answer, thanks. We had a large temporary increase of # files at one point, so your "obvious speculation" has some merit (and unfortunately it was not obvious to me until you pointed it out. :))
Jonathan Swartz