views:

70

answers:

2

We have a UFS partition on solaris.

The volume becomes full. We're still trying to write to it - and naturally open() returns -1 immediately.

When a cronjob fires up which does a mass delete, it looks like open() doesn't return in a timely manner - it's taking at least six seconds, because that's how long before the watchdog kills the process.

Now, the obvious thought is that the deletes are keeping the file system busy and open() just takes forever...but is there any concrete knowledge out there about this behaviour?

A: 

Perhaps the program doing the 'mass delete' could be changed to operate more smoothly on a filesystem that's having problems. If it does queries to find the files to delete, it might not be the open call that's timing out. To test the theory, is there some way to set up a cron job which simply removes a single file with a known name during the disk-full state? How does the 'mass delete' program decide what 'open' call to make?

It's also possible to control the percentage of disk utilization before writes stop working. You could also try setting this to a lower percentage. If you are detecting the 'disk full' state by waiting until a file-creation step returns -1, then you should consider adding an explicit check to your code so that if the filesystem is over a certain percentage full, take corrective action.

Chris Quenelle
A: 

Mass delete causes a storm of random IO which really hurts performance. And it makes as much of journal/log transactions to commit (try with the nologging option ?). Moreover, if your fs is nearly full, open would anyway takes some time to find space for a new inode.

Deleting files more often, fewer at a time, may help you to get lower response time. Or simply delete them slower, sleeping between rm.

Benoît