views:

220

answers:

4

Background

Here is the problem:

  1. A black box outputs a new number each day.
  2. Those numbers have been recorded for a period of time.
  3. Detect when a new number from the black box falls outside the pattern of numbers established over the time period.

The numbers are integers, and the time period is a year.

Question

What algorithm will identify a pattern in the numbers?

The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.

Ideas

I have some ideas, but am uncertain as to the best approach, or what solutions already exist:

  • Machine learning algorithms?
  • Neural network?
  • Classify normal and abnormal numbers?
  • Statistical analysis?
+3  A: 

Could you use statistical analysis?

The R programming language has some incredibly robust mathematical models available, some of them with the ability to forecast the trend.

Run a generalized additive model on the preceding values, then run a prediction for the next likely value. If the new value is outside the range for the predicted value, then you can email the anomaly.

See also:

A project that visually depicts a curve calculated using GAM (and other models):

The code that guesses the next expected number will be similar to the following:

gam.object <- gam( y ~ s(x) )
predict(gam.object)[1]

Where:

  • x = Dates that the integers were provided.
  • y = Set of integers (per date)

The most difficult part will be installing R. ;-)

Dave Jarvis
I think all machine learning methods are based on statistical analysis, so yes I can use it :) For the answer to be authoritative, could you explain why a generalized additive model is a good choice?
Joseph Garvin
@Joseph: When I think of machine learning, I think database entanglement. When I think of stats, I think of pure mathematical analysis that might include a database, but is not necessary. (All machine learning requires statistics, but not all statistics requires machine learning.)
Dave Jarvis
You're making a lot of assumptions about the black box here. Imagine it produces (a) prime numbers, (b) ASCII-codes for phone book entries or (c) Gödel numbers for undecidable sentences. Do you think any statistical model would find outliers in these number series?
nikie
@nikie: You are correct. I am presuming that the black box is not producing number sequences that cannot be detected through statistical analysis. The question, as phrased, however, implies that it is possible to detect the abnormalities produced by the black box. I would posit that the question would have specifically stated that the sequences produced from the black box are difficult to analyze and predict. The simplicity of the examples given, though, indicate otherwise.
Dave Jarvis
@Joseph: There is a substantial amount of documentation on the Generalized Additive Model and a number of other superb statistical models (e.g., LOWESS, ARMA). I would encourage you to research them and decide for yourself what model would best detect anomalous data.
Dave Jarvis
+1  A: 

You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice. After you fitted a line to your data, you could calculate the mean standard deviation along the line. If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.

PCA is an other technique that comes to mind, when dealing with this type of data.

You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.

Sounds like a fun problem! Good luck

Theodor
+2  A: 

There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.

Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)

Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.

I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Alexandre C.
+1  A: 

Cluster your data.

If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.

If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.

Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.

awesomo