views:

1911

answers:

6

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.

More info: Another job will fill the directory (once per day) being scanned. This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run. The Perl script is to run every 5 min and process (if applicable) up to 1000 files. File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.

Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?

A: 

Probably not. I would guess most of the time is in reading the directory entry.

However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.

Have you tried just readdir() through the directory without any other processing at all to get a baseline?

Jason Cohen
Yes, the data I provided (>4min) is just the readdir operation. I've set the process count to 1 for the test.
Walinmichi
+7  A: 

The solution would maybe lie in the other end : at the script that fills the directory...

Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?

Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?

Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).

siukurnin
+1 , I generally try keep a folder under 1000 files, any more and file-system stat() calls just chunk themselves.
Kent Fredric
"Doctor, doctor! It hurts when I touch my wrist." "Well, the solution is simple. Stop doing that!"
Brian Campbell
So, mister voodoo doctor : tell us about your magic solution. I am interested too (but don't want to sacrifice any animal in the process)
siukurnin
This solution is not appropriate as another process outside of my scope/control feeds the directory. Current analysis seems that the solution from daotoad is effective and allows control over file count listed at one time.
Walinmichi
+1  A: 

This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).

A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.

You will probably see a definite speed improvement on may operations.

Varkhan
I do not have control over the file filling part, it is just an FTP pull of zip files that are then uncompressed. Thinking of creating another script that will run once per hour or so to create single file with file names used by the more frequent posting script.
Walinmichi
+2  A: 

You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?

Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.

depesz
I don't understand how to work with compressed zip file(s). BTW there are many situations where I operate on very large sets of files... normally not an issue as I may know the file handle or obtain from another process. This case I am "dumped" the files from another process outside my control.
Walinmichi
@unknown - You can use Archive::Zip to work with zip files.
Michael Carman
+8  A: 

What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?

Are you doing something like this:

foreach my $file ( readdir($dir) ) { 
   #do stuff here
}

If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.

The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.

The fix for this is to use a while loop and use readdir in a scalar context.

while ( 
    defined( my $file = readdir $dir )
 ) {

    # do stuff.

}

Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.

daotoad
Brilliant. I may need to go back and refactor some directory access!
magnifico
the defined stuff is implicit, while (my $file = readdir $dir) { } is OK
Benoît
This solved the issue for me. Also allowed tight control over how many file names were retrieved to enable stop at desired threshold. Thanks daotoad.
Walinmichi
A: 

You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:

http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-

You can use Inotify from Perl:

http://search.cpan.org/~mlehmann/Linux-Inotify2-1.2/Inotify2.pm

The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.

(BTW, You will need an event loop to handle the timers; I recommend EV.)

jrockway