tags:

views:

64

answers:

2

I am the primary developer on a very sensitive system for my company. This code is designed pretty well but there are a few flaws in it that make it a little unstable. We are of course working to fix the flaws that cause the stability issues but in the meantime we have some things go wrong from time to time. The "wrong" thing going "wrong" could be very bad for the company though so it's imperative that in the interim we identify and fix the problems very quickly. Longer term I would like to have an automated monitoring system to do sanity checks on data and other things that will notify us of problems as they occur. Right now though in an effort to just make sure nothing catastrophic happens before we get to that point I am seeking some advice.

We have several checks (mostly data checks that can be done with a simple SQL query) to run every day. Others that should be run weekly and others monthly. In the past I have given these queries to others and made it their job to make sure they are run when they need to be. Unfortunately humans being imperfect and with inevitable turn over we always seem to end up discovering something bad that happened later than we would have liked because one or more of these manual checks were not run. Can someone offer advice or let me know of an application that might help me manage these scripts or perhaps an existing application that may do some of this work for me? At this point my only option would be a free application but if someone has a suggestion of something not free I would put it on the list of things to consider later. I know my company has an Open NMS monitoring system in place but the people in charge will not relinquish any control to me so that I can configure it for my system while at the same time they don't respond to my requests to set up monitoring at all. My company has also in the past used Nagios but I don't think either of these do exactly what I want as I'm not looking for web monitoring primarily.

Appreciate for any help / advice.

+1  A: 

Hi,

You can try AlertGrid - with this app you can set up easily notification rules like "If my scheduled task hasn't finished in time -> send SMS"

We also use AlertGrid to monitor some logic stats of our scheduled tasks (we measure some exectuion times, numbers of processed entries) and of course we have alerts when some thresholds are crossed.

There is a free account available (not time limited) with some number of alerts to use (price depends on number of SMS and phone alerts mostly). Integration with AlertGrid is very easy when compared to other solutions.

(I am a member of AlertGrid team)

Lukasz Dziedzia
will AlertGrid allow me to easily execute some SQL against a database, analyze the results, then send an alert if those results aren't in line with what we need?
omatase
AlertGrid allows you to trigger an external url (it might be url pointing to your script which will run the necessary logic). To be honest I've judged from your description that you have already some scripts which are executed in a scheduled manner and you want to be alerted if execution failed to complete for some reasons. Both cases are possible with AlertGrid. We can help you with this integration if you provide more details.
Lukasz Dziedzia
we don't have any scheduled scripts. We have some SQL scripts that we run manually. For instance one might query a table and make sure there are no rows with a StatusID of 27. If it finds rows the person running the script knows to raise a red flag. We would like that script to run automatically though and have something inform us of a problem automatically.
omatase
From what I understand the easiest solution I see is to make your scripts automated and scheduled. After execution of each script you can send interesting parameters to AlertGrid and define notification conditions around them to automate alerting procedures. Of course you can also set up alerts for situations when for some reasons scheduled tasks failed to complete or hasn't started at all. Not sure what is 'raising red flag' in this case. If it means creating eg. creating ticket in an external application this might be possible thanks to webhook actions available in AlertGrid.
Lukasz Dziedzia
most likely it would be just an email.
omatase
How does this work? So if I detect a problem I can send a notification to alert-grid.com and it will send out notifications to all parties for which we have it configured?
omatase
This wasn't what I was looking for when I set out on this but I really like alert grid. It's really cool. I've made a few test alerts with SMS and phone. Thanks for the info!
omatase
I see that PawelRoman pointed this very clearly. It's good to hear you like AlertGrid:)
Lukasz Dziedzia
yes I do, does alert grid have any competitors that accept alerts in the same way? Most I have seen ping a web site and interpret the response.
omatase
+2  A: 

What you need is to write a very simple app that uses a timer to periodically trigger an action (e.g. running an SQL script and sending an email when the query fails or whatever else you want). Then you install this app as a windows service or unix deamon, so it runs in the background all the time. Alternatively, you can trigger this app using task scheduler (windows) or cron (linux).

Tools like AlertGrid can still be helpful, because even if you write such small app and install it as a service/deamon, you'll never know if it suddenly failed and stopped for some reason (worst case is the hosting machine going down). The problem is this: if you automate a recurring task, you eliminate the possibility of human error, but you start facing another enemy: "silent" failures.

So, to monitor if your recurring tasks are really running, you have to have something that can receive "I'm alive" messages from your app and raise alerts when the message is not received in x minutes. This something must a) be OUTSIDE of the machine that hosts your app b) be RELIABLE (so it itself won't go down).

Tools like AlertGrid do exactly this, and more. AlertGrid is nice, because it is relatively easy to integrate and use.

But that's not all. You wrote: "So if I detect a problem I can send a notification to alert-grid.com and it will send out notifications to all parties for which we have it configured?". The trick is you say that YOU want to detect the problem. Consider the other approach: configure the AlertGrid to detect if the event is an incident or not. Not all events must be incidents, most of the time, your SQL scripts will pass without errors, right? Why not report success as well? This way you kill two birds with one stone: you monitor if your app is running by checking for any events (both succesfull and failed) periodically and raising alerts if you didnt get an event in x amount of time, and you automatically detect which events are incidents and send notifications by email sms or phone to appropriate contact people. Another advantage: If the notification rules change (e.g you want to send SMS to Mr X instead of email to Mrs Y) you dont need to re-compile, or redeploy your app, you only re-configure it in the AlertGrid.

PawelRoman
Very helpful. This is exactly the kind of information I was looking for. I had hoped to find something easy to implement that was free that could execute my tasks for me but ultimately knew I would move to a more customizable windows service or something like that. I hadn't considered the silently failing solution though so I really appreciate your help and your suggestion as to how to integrate this with solutions such as alertgrid.
omatase