views:

322

answers:

5

I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hundred records. Main computing part of script looks like this:

sub getDirSize {
my $dirSize = 0;
my @dirContent = <*>;

my $sizeOfFilesInDir = 0;
foreach my $dirContent (@dirContent) {
   if (-f $dirContent) {
        my $size = (stat($dirContent))[7];
        $dirSize += $size;
   } elsif (-d $dirContent) {
        $dirSize += getDirSize($dirContent);
   } 
}
return $dirSize;
}

The script is executing for more than one hour and I want to make it faster.

I was trying with the shell du command, but the output of du (transfered to bytes) is not accurate. And it is also quite time consuming. I am working on HP-UNIX 11i v1.

+1  A: 

I see a couple of problems. One @dirContent is explicitly set to <*> this will be reset each time you enter getDirSize. The result will be an infinite loop at least until you exhaust the stack (since it is a recursive call). Secondly, there is special filehandle notation for retrieving information from a stat call -- underscore (_). See: http://perldoc.perl.org/functions/stat.html. Your code as-is is calling stat three times for essentially the same information (-f, stat, and -d). Since file I/O is expensive, what you really want is to call stat once and then reference the data using "_". Here is some sample code that I believe accomplishes what you are trying to do

#!/usr/bin/perl

my $size = 0;
getDirSize(".",\$size);

print "Size: $size\n";

sub getDirSize {
  my $dir  = shift;
  my $size = shift;

  opendir(D,"$dir");
  foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
     stat($dirContent);
     if (-f _) {
       $$size += -s _;
     } elsif (-d _) {
       getDirSize("$dir/$dirContent",$size);
     } 
  }
  closedir(D);
}
Jamie
Thanks.Now it works about 10% faster. my @dirContent = <*>; was a typo, I forgot to prefix * with active directory.
+2  A: 

I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.

mobrule
Good suggestion. I will make few test runs to see what is the best way to distribute top-tier directories among forks. Thanks.
A: 

Whenever you want to speed up something, you're first task is to find out what's slow. Use a profiler such as Devel::NYTProf to analyze the program and find out where you should concentrate your efforts.

In addition to reusing that data from the last stat, I'd get rid of the recursion since Perl is horrible at it. I'd construct a stack (or a queue) and work on that until there is nothing left to process.

brian d foy
A: 

If your main directory is overwhelmingly largest consumer of directory and file inodes then don't calculate it. Calculate the other half of system and deduce the size of the rest of the system from that. (you can get used disk space from df in a couple of ms'). You might need to add a small 'fudge' factor to get to the same numbers. (also remember that if you calculate some free space as root, then you'll have some extra compared to other users 5% in ext2/ext3 on Linux, don't know about HPUX).

Pasi Savolainen