ansaurus

Question

Recommended anomaly detection technique for simple, one-dimensional scenario?

Answer 1

+5 A:

mu = mean of the data
std = standard deviation of the data
if abs(x-mu) > 3*std  then  x is outlier

An alternative method is the IQR outlier test:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
if abs(x-Q75) > 1.5*IQR  then  x is a mild outlier
if abs(x-Q75) > 3.0*IQR  then  x is an extreme outlier

this is usually used in Box plots indicated by the whiskers:

EDIT:

For your case (simple 1D univariate data), I think my first answer is well suited. That however isnt applicable to multivariate data.

@smaclell suggested using Kmeans to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with kmeans is that it requires knowing in advance a good value for the number of clusters K.

A better suited technique is the DBscan: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

alt text

DBScan requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

If the number of neighbors is less than minPoints, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

Finally the set of all points marked as noise are considered outliers.

Amro 2010-02-20 20:21:01

COOL! Thank you for your wonderful answer and explanations.

smaclell 2010-02-21 02:06:17

+1 three-sigma and IQR look like good techniques, thanks for the insightful answer.

Grundlefleck 2010-02-21 09:48:44

I like this simple advice. The IQR based statistic has the advantage of not being influenced by extreme outliers which will change the mean/sd.

Tristan 2010-02-21 18:52:57

Sometimes the simplest solution is as robust as a complicated one..

Amro 2010-02-21 19:19:45

Answer 2

+2 A:

There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

smaclell 2010-02-20 20:24:12

Agreed. K-Means is a simple, effective, and adaptive solution for this problem. Create two clusters, initialize properly, and one of the clusters should contain the meaningful data while the other gets the outlier(s). But be careful; if you have no outliers, then both clusters will contain meaningful data.

Steve 2010-02-20 20:43:01

Well that is where it gets fun. It is often very difficult to determine the number of clusters and would be even harder doing it in a live system. Even in that case of one true cluster and another outlier cluster it could be argued the outliers are starting to represent a real mode for the data. I am going to add more links to provide other options.

smaclell 2010-02-20 21:15:12

This strikes me as the wrong tool for the job. He's primarily interested in fat tails, not bimodal distributions.

Tristan 2010-02-20 21:21:52

It depends on the asker's intent, so we cannot be completely sure. If the only intent is to assess how anomalous a data point is, then use simple statistics, of course. But if you want to, say, use the "good" data as an input to a subsequent function, then there may be value in classifying the points as "good" or "bad" (e.g., through K-means, etc.).

Steve 2010-02-20 22:09:54

But algorithms like k-means offer no ability for the user to define what anomalous means. The kmeans clusters are simply solutions to a very specific cluster definition.

Tristan 2010-02-20 22:43:01

@Steve That is actually wrong. There is no reason why all the outliers should form a cluster. K-Means finds clusters for which the euclidean distance from its center is minimized - if the outliers are distributed evenly around the clusters, this will not help at all. The Euclidean distance results from a Gaussian assumption which is very vulnerable to outliers. Don't use K-Means for outlier detection only. You might want to use it for preprocessing and using three sigma afterwards, as stated by the original author.

bayer 2010-02-20 23:05:59

@Tristan I am not sure if this is the wrong tool for the job but it is overkill for simple domains and having to pick a K still adds complexity. I just wanted to open the door for conventional clustering methods and how they could be applied to help determine outliers. Thanks you all for the discussion.

smaclell 2010-02-20 23:32:52

I'm not saying K-means is unequivocally the right tool, but by no means is it unequivocally the wrong tool, and therefore I find the statement "that is actually wrong" a gross overgeneralization. The asker says "d is clearly an anomaly, and I would want to perform a specific action based on this." From this statement, I interpret a desire to assign hard labels: outlier, or not. First, three sigma is only an *empirical* choice! Is two sigma better? Or four? Neither of us know, because we don't know the data well enough. From the example given, I only see nonnegative data with ...

Steve 2010-02-21 01:36:12

... the only outlier being >> 0. With that example, for most intelligent initializations of two cluster centers, K-means will correctly cluster (a,b,c,e) together; this can easily be verified by hand. Of course, if you also have nonnegative data, the outliers will not form a single cluster, I'm absolutely aware of that. But we don't know what the data is like. We don't know if it is uni- or multi-modal, how Gaussian it is or isn't, and its support. But I do know that there *exist* cases, even in one dimension, where K-means *will work*.

Steve 2010-02-21 01:41:08

+1 for the suggestion and the insight in the comments.

Grundlefleck 2010-02-21 09:46:47

ansaurus

tags:

views:

answers:

Recommended anomaly detection technique for simple, one-dimensional scenario?

related questions