views:

238

answers:

1

We're using Quartz.Net to schedule about two hundred repeating jobs. Each job uses the same IJob implementing class, but they can have different schedules. In practice, they end up having the same schedule, so we have about two hundred job details, each with their own (identical) repeating/simple trigger, scheduled. The interval is one hour.

The task this job performs is to download an rss feed, and then download all of the media files linked to in the rss feed. Prior to downloading, it wipes the directory where it is going to place the files. A single run of a job takes anywhere from a couple seconds to a dozen seconds (occasionally more).

Our method of scheduling is to call GetScheduler() on a new StdSchedulerFactory (all jobs are scheduled at once into the same IScheduler instance). We follow the scheduling with an immediate Start().

The jobs appear to run fine, but upon closer inspection we are seeing that a minority of the jobs occasionally - or almost never - run.

So, for example, all two hundred jobs should have run at 6:40 pm this evening. Most of them did. But a handful did not. I determine this by looking at the file timestamps, which should certainly be updated if the job runs (because it deletes and redownloads the file).

I've enabled Quartz.Net logging, and added quite a few logging statements to our code as well.

I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.

After that, all activity stops. No jobs run, no log messages are created. Zero.

And then, at the next firing interval, Quartz starts up again and my log files update, and various files start downloading. But - it certainly appears like some JobDetail instances never make it to the head of the line (so to speak) or do so very infrequently. Over the entire weekend, some jobs appeared to update quite frequently, and recently, and others had not updated a single time since starting the process on Friday (it runs in a Windows Service shell, btw).

So ... I'm hoping someone can help me understand this behavior of Quartz.

I need to be certain that every job runs. If it's trigger is missed, I need Quartz to run it as soon as possible. From reading the documentation, I thought this would be the default behavior - for SimpleTrigger with an indefinite repeat count it would reschedule the job for immediate execution if the trigger window was missed. This doesn't seem to be the case. Is there any way I can determine why Quartz is not firing these jobs? I am logging at the trace level and they just simply aren't there. It creates and executes an awful lot of jobs, but if I notice one missing - all I can find is that it ran it the last time (for example, sometimes it hasn't run for hours or days). Nothing about why it was skipped (I expected Quartz to log something if it skips a job for any reason), etc.

Any help would really, really be appreciated - I've spent my entire day trying to figure this out.

+1  A: 

After reading your post, it sounds a lot like the handful of jobs that are not executing are very likely misfiring. The reason that I believe this:

I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.

In Quartz.NET the default misfire threshold is 1 minute. Chances are, you need to examine your logging configuration to determine why those misfire events are not being logged. I bet if you throw open the the floodgates on your logging (ie. set everything to debug, and make sure that you definitely have a logging directive for the Quartz scheduler class), and then rerun your jobs. I'm almost positive that the problem is the misfire events are not showing up in your logs because the logging configuration is lacking something. This is understandable, because logging configuration can get very confusing, very quickly.

Also, in the future, you might want to consult the quartz.net forum on google, since that is where some of the more thorny issues are discussed.

http://groups.google.com/group/quartznet?pli=1

Now, your other question about setting the policy for what the scheduler should do, I can't specifically help you there, but if you read the API docs closely, and also consult the google discussion group, you should be able to easily set the misfire policy flag that suits your needs. I believe that Trigger's have a MisfireInstruction property which you can configure.

Also, I would argue that misfires introduce a lot of "noise" and should be avoided; perhaps bumping up the thread count on your scheduler would be a way to avoid misfires? The other option would be to stagger your job execution into separate/multiple batches.

Good luck!

warriorpostman
I was thinking similarly, 1 min makes it seem like a misfire. The logging configuration is very straightforward, and all levels should be getting output. I also did cross-post the question in the google group prior to asking here. Besides the thread count, I think I can raise the misfire window to a longer period. The API docs were clear that the default MisfireInst. for SimpleTrigger is RescheduleNowWithExistingRepeatCount, which seems like what I want, but doesn't do what it seems to say it will, and the docs say FireNow is equiv. to the default if the repeat count is indefinite.
qstarin
I was able to finally trap the misfires by attaching a global TriggerListener. It did appear that the one minute misfire threshold was passing and further jobs were being skipped. The misfire behavior observed doesn't seem to match the documentation, but nevertheless, increasing the misfire threshold helps avoid the problem. It still leaves a hole for jobs to fall through, though.
qstarin
Glad you were able to uncover the issue. I wonder if we had hooked up the global TriggerListener on my project. For some reason, I thought those might have been automatically logged by the scheduler.Good luck with configuring misfire policies. That documentation is way lacking, and unfortunately, sometimes the only solution is a lot of trial and error.
warriorpostman