views:

60

answers:

4

I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a system at two levels:

  1. Database wide: Eg: Compare salaries of all employees and identify ones that are too low or too high from the average.

  2. Per employee: Eg: Check salary history for employee and identify payments that are out of the ordinary for the employee.

The two above are only examples, take for instance the case with ATM withdrawals, Shopping order history, Invoice history, etc.

A: 

I don't have MySQL installed at the moment but I guess the first can be achieved with a query similar to this (off the top of my head, not tested, could not work at all):

SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);

Or, a more complex query would be:

SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
        (SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);

The 2nd one basically selects the employees whose salaries differ from the average of the salaries by more than the average of the difference in all the employees' salaries.

Lemme know if it works.

Leo Jweda
Also, creating a view could be helpful if the query is recurring.
Agos
A: 

The hard part is defining "out of the ordinary."

What you're trying to do is what fraud detection software for figuring out when somebody is laundering money is all about. Your simple example is an easy one. The more complex ones are done with databases, statistics, data mining, and rules engines that contain lots of rules. It's not an easy problem, unless you want to restrict yourself to the trivial case that you cited.

If you manage to turn it into an easy problem, you'll be a wealthy person. Good luck.

duffymo
A: 

You could use Analysis Services and a data mining model.

Obviously you'd have to adapt the code, but here's a sample from Microsoft:

http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83

"This sample shows how the clustering algorithm can be used to perform automatic data validation through the use of the PredictCaseLikelihood() function. To exercise the sample, enter values into the form and click the submit button. If the combination of values has a reasonable likelihood, the form will accept the values. If not, additional elements of the prediction query indicate the value likely to be unacceptable. Checking the “Show Details” box on the form will show the query that was sent in addition to the probability ratios used to determine the outlying values."

Shane Cusson
A: 

There are different methods for finding outliers: distance-based, cluster-based, etc.

You could use Data Applied's outlier detection or clustering analytics. The first one automatically finds records which are most different from their N closest neighbors. The second finds large groups (clusters) of records, and identifies records which don't fit well any cluster. They make it free for small data sets, and it's online (http://www.data-applied.com). You don't have to write code, but you can use their Web API if you want.

Mark Senizer